Airflow task callbacks are missed sometimes - airflow

Aiflow: 2.1.2 -
Executor: KubernetesExecutor -
Python: 3.7
I have written tasks using Airflow 2+ TaskFlow API and running the Airflow application in KubernetesExecutor mode. There are success and failure callbacks on the task but sometimes they get missed.
I've tried to specify the callbacks both via default_args on DAG and directly in the task decorator but seeing same behaviour.
#task(
on_success_callback=common.on_success_callback,
on_failure_callback=common.on_failure_callback,
)
def delta_load_pstn(files):
# doing something here
Here are the closing logs of the task
2022-04-26 11:21:38,494] Marking task as SUCCESS. dag_id=delta_load_pstn, task_id=dq_process, execution_date=20220426T112104, start_date=20220426T112131, end_date=20220426T112138
[2022-04-26 11:21:38,548] 1 downstream tasks scheduled from follow-on schedule check
[2022-04-26 11:21:42,069] State of this instance has been externally set to success. Terminating instance.
[2022-04-26 11:21:42,070] Sending Signals.SIGTERM to GPID 34
[2022-04-26 11:22:42,081] process psutil.Process(pid=34, name='airflow task runner: delta_load_pstn dq_process 2022-04-26T11:21:04.747263+00:00 500', status='sleeping', started='11:21:31') did not respond to SIGTERM. Trying SIGKILL
[2022-04-26 11:22:42,095] Process psutil.Process(pid=34, name='airflow task runner: delta_load_pstn dq_process 2022-04-26T11:21:04.747263+00:00 500', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='11:21:31') (34) terminated with exit code Negsignal.SIGKILL
[2022-04-26 11:22:42,095] Job 500 was killed before it finished (likely due to running out of memory)
And i can see in the task instance details that the callbacks are configured.
If I implement the on_execute_callback which is called before the execution of the task, I do get the alert (in Slack). So my guess is it's definitely something with killing the pod before the callback is handled.

Related

Airflow (MWAA) tasks receiving SIGTERM but task is externally set to success

We face a lot of our Airflow (MWAA) tasks receiving SIGTERM:
[2022-10-06 06:23:48,347] {{logging_mixin.py:104}} INFO - [2022-10-06 06:23:48,347] {{local_task_job.py:188}} WARNING - State of this instance has been externally set to success. Terminating instance.
[2022-10-06 06:23:48,348] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 2740
[2022-10-06 06:23:55,113] {{taskinstance.py:1265}} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-10-06 06:23:55,164] {{process_utils.py:66}} INFO - Process psutil.Process(pid=2740, status='terminated', exitcode=1, started='06:23:42') (2740) terminated with exit code 1
It happens to a few of our tasks and it would not have been a big deal if the tasks were not set as a SUCCESS:
State of this instance has been externally set to success. Terminating instance
We understood that this can happen because of a lack of memory within the worker. We tried to increase the number of workers without any success. What would be our solutions to avoid having set tasks externally killed?
When tasks are getting killed, they are marked as failed. Here it seems to be the other way around. The task seem to get marked by something/someone as a success, after which the job is stopped/killed.
I am not aware of how Mwaa is deployed, but I would have a look at the action logging to see what/who is marking these tasks as success.

Airflow tasks ending up in Retry state without logs

Hi I'm currently running airflow on a Dataproc cluster. My DAGs used to run fine but facing this issue where tasks are ending up in 'retry' state without any logs when I click on task instance -> logs on airflow UI
I see the following error in terminal where I started the airflow webserver
2022-06-24 07:30:36.544 [ERROR] Executor reports task instance
<TaskInstance: **task name** 2022-06-23 07:00:00+00:00 [queued]> finished (failed)
although the task says its queued. Was the task killed externally?
None
[2022-06-23 06:08:33,202] {models.py:1758} INFO - Marking task as UP_FOR_RETRY
2022-06-23 06:08:33.202 [INFO] Marking task as UP_FOR_RETRY
What I tried so far
restarted webserver
Started server from 3 different ports
re-ran backfill command with 3 different timestamps
deleted dag runs for my dag, created a new dag run and then re-ran backfill command
cleared the PID as mentioned here How do I restart airflow webserver? and restarted the webserver
None of these worked. This issue is persistent for the past two days, appreciate any help here.At this point I'm guessing this is to do with a shared DB but not sure how to fix this.
<<update>> So what I also found is these tasks eventually go to success or failure state. when that happens the logs are available, but still no logs for the retry attempts in $airflow_home or our remote directory
The issue was there was another celery worker listening on the same queue. since this second worker was not configured properly it was failing the task and not writing the logs to remote location.

DataprocSubmitJobOperator Fails Intermittent With Zombie

We are using Airflow as orchestrator where it schedule workflow every hour. DataprocSubmitJobOperator is configured to schedule dataproc jobs (it uses spark). Spark sync data from source to target (runs for 50 min and then completes to avoid next schedule overlap).
Intermittent Airflow task fails due to zombie Exception. Logs show assertion failure due to pthread_mutex_lock(mu). Airflow Task exits. Underlying dataproc Job keeps running without issue.
Please suggest what can be potential issue/fix?
[2021-12-22 23:01:17,150] {dataproc.py:1890} INFO - Submitting job
[2021-12-22 23:01:17,804] {dataproc.py:1902} INFO - Job 27a2c88d-1308-4407-b965-aa490e2217fb submitted successfully.
[2021-12-22 23:01:17,805] {dataproc.py:1905} INFO - Waiting for job 27a2c88d-1308-4407-b965-aa490e2217fb to complete
E1222 23:45:58.299007027 1267 sync_posix.cc:67] assertion failed: pthread_mutex_lock(mu) == 0
[2021-12-22 23:46:00,943] {local_task_job.py:102} INFO - Task exited with return code Negsignal.SIGABRT
Config
raw_data_sync = DataprocSubmitJobOperator(
task_id="raw_data_sync",
job=RAW_DATA_GENERATION,
location='us-central1',
project_id='1f780b38bd7b0384e53292de20',
execution_timeout=timedelta(seconds=3420),
dag=dag
)

Airflow 1.9.0 is queuing but tasks are not running

Airflow stopped running tasks all of a sudden. Below are all running
airflow scheduler
airflow webserver
airflow worker
webui message
All dependencies are met but the task instance is not running. In most
cases this just means that the task will probably be scheduled soon
unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow
administrator for assistance.
Scheduler seems to be in a loop, keeps repeating the below messages. WebUI shows tasks are in queued state. Tried restarting the scheduler, didn't help.
[2018-11-17 22:03:45,809] {{jobs.py:1607}} DEBUG - Starting Loop...
[2018-11-17 22:03:45,809] {{jobs.py:1627}} INFO - Heartbeating the process manager
[2018-11-17 22:03:45,810] {{jobs.py:1662}} INFO - Heartbeating the executor
[2018-11-17 22:03:45,810] {{base_executor.py:103}} DEBUG - 124 running task instances
[2018-11-17 22:03:45,810] {{base_executor.py:104}} DEBUG - 0 in queue
[2018-11-17 22:03:45,810] {{base_executor.py:105}} DEBUG - 76 open slots
[2018-11-17 22:03:45,810] {{base_executor.py:132}} DEBUG - Calling the <class 'airflow.executors.celery_executor.CeleryExecutor'> sync method
[2018-11-17 22:03:45,810] {{celery_executor.py:80}} DEBUG - Inquiring about 124 celery task(s)
Airflow setup:
apache-airflow[celery, redis, all]==1.9.0
I also checked these posts but didn't help me:
Airflow 1.9.0 is queuing but not launching tasks
Airflow tasks get stuck at "queued" status and never gets running
Problem solved. This is a problem when you create your build on or after 2018-11-15 Turns out apache-airflow[celery, redis, all]==1.9.0 takes the latest version of redis-py 3.0.1 which does not work with celery 4.2.1.
Solution is to use redis-py 2.10.6
redis==2.10.6
apache-airflow[celery, all]==1.9.0

airflow - all tasks being queued and not moved to execution

airflow 1.8.1
Scheduler, worker and webserver are running in separate dockers on AWS.
The system was operational, and now for some reason all tasks are staying in queued state...
No errors in scheduler logs.
In worker I see this error (not sure if its related since scheduler should move tasks from queued state):
[2018-01-23 20:46:00,428] {base_task_runner.py:95} INFO - Subtask: [2018-01-23 20:46:00,428] {models.py:1122} INFO - Dependencies not met for , dependency 'Task Instance State' FAILED: Task is in the 'success' state which is not a valid state for execution. The task must be cleared in order to be run.
I tried reboots, airflow clear and then resetdb commands but it did not help.
Any idea what else can be done to fix that problem?
Thanks

Resources