I'm hitting some issues where a few of my tasks are timing out. When the task reruns, or I retry the task, it works fine.
Is there a way to increase the time out either in the config, or within the DAG itself to avoid these errors?
ERROR - Process timed out
ERROR - Failed to import: /home/ec2-user/airflow/dags/dse/mydag.py
airflow.exceptions.AirflowTaskTimeout: Timeout
airflow.exceptions.AirflowException: dag_id could not be found: mydag. Either the dag did not exist or it failed to parse.
Related
We face a lot of our Airflow (MWAA) tasks receiving SIGTERM:
[2022-10-06 06:23:48,347] {{logging_mixin.py:104}} INFO - [2022-10-06 06:23:48,347] {{local_task_job.py:188}} WARNING - State of this instance has been externally set to success. Terminating instance.
[2022-10-06 06:23:48,348] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 2740
[2022-10-06 06:23:55,113] {{taskinstance.py:1265}} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-10-06 06:23:55,164] {{process_utils.py:66}} INFO - Process psutil.Process(pid=2740, status='terminated', exitcode=1, started='06:23:42') (2740) terminated with exit code 1
It happens to a few of our tasks and it would not have been a big deal if the tasks were not set as a SUCCESS:
State of this instance has been externally set to success. Terminating instance
We understood that this can happen because of a lack of memory within the worker. We tried to increase the number of workers without any success. What would be our solutions to avoid having set tasks externally killed?
When tasks are getting killed, they are marked as failed. Here it seems to be the other way around. The task seem to get marked by something/someone as a success, after which the job is stopped/killed.
I am not aware of how Mwaa is deployed, but I would have a look at the action logging to see what/who is marking these tasks as success.
We are using Airflow as orchestrator where it schedule workflow every hour. DataprocSubmitJobOperator is configured to schedule dataproc jobs (it uses spark). Spark sync data from source to target (runs for 50 min and then completes to avoid next schedule overlap).
Intermittent Airflow task fails due to zombie Exception. Logs show assertion failure due to pthread_mutex_lock(mu). Airflow Task exits. Underlying dataproc Job keeps running without issue.
Please suggest what can be potential issue/fix?
[2021-12-22 23:01:17,150] {dataproc.py:1890} INFO - Submitting job
[2021-12-22 23:01:17,804] {dataproc.py:1902} INFO - Job 27a2c88d-1308-4407-b965-aa490e2217fb submitted successfully.
[2021-12-22 23:01:17,805] {dataproc.py:1905} INFO - Waiting for job 27a2c88d-1308-4407-b965-aa490e2217fb to complete
E1222 23:45:58.299007027 1267 sync_posix.cc:67] assertion failed: pthread_mutex_lock(mu) == 0
[2021-12-22 23:46:00,943] {local_task_job.py:102} INFO - Task exited with return code Negsignal.SIGABRT
Config
raw_data_sync = DataprocSubmitJobOperator(
task_id="raw_data_sync",
job=RAW_DATA_GENERATION,
location='us-central1',
project_id='1f780b38bd7b0384e53292de20',
execution_timeout=timedelta(seconds=3420),
dag=dag
)
I am using Airflow in a Docker container. I run a DAG with multiple Jupyter notebooks. I have the following error everytime after 60 minutes:
[2021-08-22 09:15:15,650] {local_task_job.py:198} WARNING - State of this instance has been externally set to skipped. Terminating instance.
[2021-08-22 09:15:15,654] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 277
[2021-08-22 09:15:15,655] {taskinstance.py:1284} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-08-22 09:15:18,284] {taskinstance.py:1501} ERROR - Task failed with exception
I tried to tweak the config file but could not find the good option to remove the 1 hour timeout.
Any help would be appreciated.
The default is no timeout. When your DAG defines dagrun_timeout=timedelta(minutes=60) and execution time exceeds 60 minutes then active task stops with message "State of this instance has been externally set to skipped" logged.
airflow 1.8.1
Scheduler, worker and webserver are running in separate dockers on AWS.
The system was operational, and now for some reason all tasks are staying in queued state...
No errors in scheduler logs.
In worker I see this error (not sure if its related since scheduler should move tasks from queued state):
[2018-01-23 20:46:00,428] {base_task_runner.py:95} INFO - Subtask: [2018-01-23 20:46:00,428] {models.py:1122} INFO - Dependencies not met for , dependency 'Task Instance State' FAILED: Task is in the 'success' state which is not a valid state for execution. The task must be cleared in order to be run.
I tried reboots, airflow clear and then resetdb commands but it did not help.
Any idea what else can be done to fix that problem?
Thanks
I am facing a bug in case where a task marks itself as failed after retrying for the max number of retries. And then the task restarts itself and marks itself with running. This interferes with the next run of the same task as sometimes they both run in parallel.