Airflow stopped running tasks all of a sudden. Below are all running
airflow scheduler
airflow webserver
airflow worker
webui message
All dependencies are met but the task instance is not running. In most
cases this just means that the task will probably be scheduled soon
unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow
administrator for assistance.
Scheduler seems to be in a loop, keeps repeating the below messages. WebUI shows tasks are in queued state. Tried restarting the scheduler, didn't help.
[2018-11-17 22:03:45,809] {{jobs.py:1607}} DEBUG - Starting Loop...
[2018-11-17 22:03:45,809] {{jobs.py:1627}} INFO - Heartbeating the process manager
[2018-11-17 22:03:45,810] {{jobs.py:1662}} INFO - Heartbeating the executor
[2018-11-17 22:03:45,810] {{base_executor.py:103}} DEBUG - 124 running task instances
[2018-11-17 22:03:45,810] {{base_executor.py:104}} DEBUG - 0 in queue
[2018-11-17 22:03:45,810] {{base_executor.py:105}} DEBUG - 76 open slots
[2018-11-17 22:03:45,810] {{base_executor.py:132}} DEBUG - Calling the <class 'airflow.executors.celery_executor.CeleryExecutor'> sync method
[2018-11-17 22:03:45,810] {{celery_executor.py:80}} DEBUG - Inquiring about 124 celery task(s)
Airflow setup:
apache-airflow[celery, redis, all]==1.9.0
I also checked these posts but didn't help me:
Airflow 1.9.0 is queuing but not launching tasks
Airflow tasks get stuck at "queued" status and never gets running
Problem solved. This is a problem when you create your build on or after 2018-11-15 Turns out apache-airflow[celery, redis, all]==1.9.0 takes the latest version of redis-py 3.0.1 which does not work with celery 4.2.1.
Solution is to use redis-py 2.10.6
redis==2.10.6
apache-airflow[celery, all]==1.9.0
Related
Hi I'm currently running airflow on a Dataproc cluster. My DAGs used to run fine but facing this issue where tasks are ending up in 'retry' state without any logs when I click on task instance -> logs on airflow UI
I see the following error in terminal where I started the airflow webserver
2022-06-24 07:30:36.544 [ERROR] Executor reports task instance
<TaskInstance: **task name** 2022-06-23 07:00:00+00:00 [queued]> finished (failed)
although the task says its queued. Was the task killed externally?
None
[2022-06-23 06:08:33,202] {models.py:1758} INFO - Marking task as UP_FOR_RETRY
2022-06-23 06:08:33.202 [INFO] Marking task as UP_FOR_RETRY
What I tried so far
restarted webserver
Started server from 3 different ports
re-ran backfill command with 3 different timestamps
deleted dag runs for my dag, created a new dag run and then re-ran backfill command
cleared the PID as mentioned here How do I restart airflow webserver? and restarted the webserver
None of these worked. This issue is persistent for the past two days, appreciate any help here.At this point I'm guessing this is to do with a shared DB but not sure how to fix this.
<<update>> So what I also found is these tasks eventually go to success or failure state. when that happens the logs are available, but still no logs for the retry attempts in $airflow_home or our remote directory
The issue was there was another celery worker listening on the same queue. since this second worker was not configured properly it was failing the task and not writing the logs to remote location.
Aiflow: 2.1.2 -
Executor: KubernetesExecutor -
Python: 3.7
I have written tasks using Airflow 2+ TaskFlow API and running the Airflow application in KubernetesExecutor mode. There are success and failure callbacks on the task but sometimes they get missed.
I've tried to specify the callbacks both via default_args on DAG and directly in the task decorator but seeing same behaviour.
#task(
on_success_callback=common.on_success_callback,
on_failure_callback=common.on_failure_callback,
)
def delta_load_pstn(files):
# doing something here
Here are the closing logs of the task
2022-04-26 11:21:38,494] Marking task as SUCCESS. dag_id=delta_load_pstn, task_id=dq_process, execution_date=20220426T112104, start_date=20220426T112131, end_date=20220426T112138
[2022-04-26 11:21:38,548] 1 downstream tasks scheduled from follow-on schedule check
[2022-04-26 11:21:42,069] State of this instance has been externally set to success. Terminating instance.
[2022-04-26 11:21:42,070] Sending Signals.SIGTERM to GPID 34
[2022-04-26 11:22:42,081] process psutil.Process(pid=34, name='airflow task runner: delta_load_pstn dq_process 2022-04-26T11:21:04.747263+00:00 500', status='sleeping', started='11:21:31') did not respond to SIGTERM. Trying SIGKILL
[2022-04-26 11:22:42,095] Process psutil.Process(pid=34, name='airflow task runner: delta_load_pstn dq_process 2022-04-26T11:21:04.747263+00:00 500', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='11:21:31') (34) terminated with exit code Negsignal.SIGKILL
[2022-04-26 11:22:42,095] Job 500 was killed before it finished (likely due to running out of memory)
And i can see in the task instance details that the callbacks are configured.
If I implement the on_execute_callback which is called before the execution of the task, I do get the alert (in Slack). So my guess is it's definitely something with killing the pod before the callback is handled.
We are using Airflow as orchestrator where it schedule workflow every hour. DataprocSubmitJobOperator is configured to schedule dataproc jobs (it uses spark). Spark sync data from source to target (runs for 50 min and then completes to avoid next schedule overlap).
Intermittent Airflow task fails due to zombie Exception. Logs show assertion failure due to pthread_mutex_lock(mu). Airflow Task exits. Underlying dataproc Job keeps running without issue.
Please suggest what can be potential issue/fix?
[2021-12-22 23:01:17,150] {dataproc.py:1890} INFO - Submitting job
[2021-12-22 23:01:17,804] {dataproc.py:1902} INFO - Job 27a2c88d-1308-4407-b965-aa490e2217fb submitted successfully.
[2021-12-22 23:01:17,805] {dataproc.py:1905} INFO - Waiting for job 27a2c88d-1308-4407-b965-aa490e2217fb to complete
E1222 23:45:58.299007027 1267 sync_posix.cc:67] assertion failed: pthread_mutex_lock(mu) == 0
[2021-12-22 23:46:00,943] {local_task_job.py:102} INFO - Task exited with return code Negsignal.SIGABRT
Config
raw_data_sync = DataprocSubmitJobOperator(
task_id="raw_data_sync",
job=RAW_DATA_GENERATION,
location='us-central1',
project_id='1f780b38bd7b0384e53292de20',
execution_timeout=timedelta(seconds=3420),
dag=dag
)
I met an issue that my task in a tag never got pick up by workers for some reason.
When I look at the task details:
All dependencies are met but the task instance is not running. In most
cases this just means that the task will probably be scheduled soon
unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow
administrator for assistance.
I checked the scheduler, no errors in the log, also restarted it a few times.
I also checked the airflow websever log, only notice this:
22/11/2018 12:10:39[2018-11-22 01:10:39,747] {{cli.py:644}} DEBUG - [5
/ 5] killing 1 workers 22/11/2018 12:10:39[2018-11-22 01:10:39 +0000]
[43] [INFO] Handling signal: ttou 22/11/2018 12:10:39[2018-11-22
01:10:39 +0000] [348] [INFO] Worker exiting (pid: 348)
Not sure what happens, it worked fine before.
Airflow version 1.9.0, never change the version, only playing around some of the config: min_file_process_interval and dag_dir_list_interval (but I put it back to default when encounter this issue)
I do notice that this happens when I am playing around with some of the airflow config and rebuild our docker airflow image, then I revert it back to the original version, which used to work. Then the problem solved.
I also notice one error occurred (but not always captured) in my celery workers when I use the newly built image:
Unrecoverable error: AttributeError("'float' object has no attribute 'items'",)
So find that it is related to the latest redis release (Celery will use redis), you can find more details.
airflow 1.8.1
Scheduler, worker and webserver are running in separate dockers on AWS.
The system was operational, and now for some reason all tasks are staying in queued state...
No errors in scheduler logs.
In worker I see this error (not sure if its related since scheduler should move tasks from queued state):
[2018-01-23 20:46:00,428] {base_task_runner.py:95} INFO - Subtask: [2018-01-23 20:46:00,428] {models.py:1122} INFO - Dependencies not met for , dependency 'Task Instance State' FAILED: Task is in the 'success' state which is not a valid state for execution. The task must be cleared in order to be run.
I tried reboots, airflow clear and then resetdb commands but it did not help.
Any idea what else can be done to fix that problem?
Thanks