Airflow Exception - Task received SIGTERM signal - airflow

I am running airflow tasks using SSH operator. I am pretty sure that the python program has no error and runs successfully when i run it. But when run from airflow towards the end of program execution I end up with SIGTERM error.
I tried to figure out by looking into various solutions but nothing worked. I tried increasing
killed_task_cleanup_time = 1200 from 60 in airflow.cfg file. Also tried changing hostname_callable to socket:gethostname in airflow.cfg as I received the following warning before this error
Warning: The recorded hostname xxx does not match this instance's hostname
Error:
[2020-10-15 10:45:34,937] {taskinstance.py:954} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-10-15 10:45:34,959] {taskinstance.py:1145} ERROR - SSH operator error: Task received SIGTERM signal
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/airflow/contrib/operators/ssh_operator.py", line 137, in execute
readq, _, _ = select([channel], [], [], self.timeout)
File "/opt/anaconda3/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 956, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
Any ideas and suggestions are teally helpful. Stuck with this for a day now

This problem is triggered by the fact that the RECORDED hostname XXX maps an IP address that is different from the IP address mapped by instance's hostname, throwing a SIGTERM error. So you need to specify the IP mapping for the recorded Hostname XXX

Possibly this thread might help? https://issues.apache.org/jira/browse/AIRFLOW-966.
Which version of airflow are you using, and did you check your celery broker settings?
The solution seems to be setting visibility timeout higher than the celery default, which is 1 hour, to prevent celery from re-submitting the job. I believe this only affects tasks created via manual run / CLI (not normally scheduled tasks.)

Related

Airflow (MWAA) tasks receiving SIGTERM but task is externally set to success

We face a lot of our Airflow (MWAA) tasks receiving SIGTERM:
[2022-10-06 06:23:48,347] {{logging_mixin.py:104}} INFO - [2022-10-06 06:23:48,347] {{local_task_job.py:188}} WARNING - State of this instance has been externally set to success. Terminating instance.
[2022-10-06 06:23:48,348] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 2740
[2022-10-06 06:23:55,113] {{taskinstance.py:1265}} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-10-06 06:23:55,164] {{process_utils.py:66}} INFO - Process psutil.Process(pid=2740, status='terminated', exitcode=1, started='06:23:42') (2740) terminated with exit code 1
It happens to a few of our tasks and it would not have been a big deal if the tasks were not set as a SUCCESS:
State of this instance has been externally set to success. Terminating instance
We understood that this can happen because of a lack of memory within the worker. We tried to increase the number of workers without any success. What would be our solutions to avoid having set tasks externally killed?
When tasks are getting killed, they are marked as failed. Here it seems to be the other way around. The task seem to get marked by something/someone as a success, after which the job is stopped/killed.
I am not aware of how Mwaa is deployed, but I would have a look at the action logging to see what/who is marking these tasks as success.

Airflow Papermill operator: task externally skipped after 60 minutes

I am using Airflow in a Docker container. I run a DAG with multiple Jupyter notebooks. I have the following error everytime after 60 minutes:
[2021-08-22 09:15:15,650] {local_task_job.py:198} WARNING - State of this instance has been externally set to skipped. Terminating instance.
[2021-08-22 09:15:15,654] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 277
[2021-08-22 09:15:15,655] {taskinstance.py:1284} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-08-22 09:15:18,284] {taskinstance.py:1501} ERROR - Task failed with exception
I tried to tweak the config file but could not find the good option to remove the 1 hour timeout.
Any help would be appreciated.
The default is no timeout. When your DAG defines dagrun_timeout=timedelta(minutes=60) and execution time exceeds 60 minutes then active task stops with message "State of this instance has been externally set to skipped" logged.

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

apache airflow 1.10.9 statsd enabled making scheduler crashed

my airflow running in CeleryExecutor mode + progresql 12, all things go well except when turning statsd on:
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow
The schedulers can render jobs but jobs are not running, the scheduler log having below error:
[SQL: SELECT count(*) AS count_1
FROM task_instance
WHERE task_instance.pool = %(pool_1)s AND task_instance.state IN (%(state_1)s, %(state_2)s)]
[parameters: {'pool_1': 'default_pool', 'state_1': 'running', 'state_2': 'queued'}]
(Background on this error at: http://sqlalche.me/e/4xp6)[0m
[31mTraceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1246, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 588, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.ProtocolViolation: invalid frontend message type 97
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1495, in _validate_and_run_task_instances
self._process_and_execute_tasks(simple_dag_bag)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 588, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (psycopg2.errors.ProtocolViolation) invalid frontend message type 97
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
If disable statsd, everything resume. Is it a bug for airflow? any advise to resolve it?
I faced the same error, and after a few tests, i can get statsd metrics working. Typically, you will see the error if the following conditions are met.
Statsd enabled set to True
SqlAlchemy connection pool set to True
Scheduler syserr log enabled (by redirect the err log to a file where you can see this error)
In my case, even though the scheduler kept throwing the error logs, statsd metrics were still delivered, and tasks were also scheduled as they should. I dont know how to measure the impact, i also dont want to sacrifice sql_alchemy connection pool, so I leave statsd turned off.
(I guess other people not seeing the error because they are missing the 3rd one above)

AirflowException: Celery command failed - The recorded hostname does not match this instance's hostname

I'm running Airflow on a clustered environment running on two AWS EC2-Instances. One for master and one for the worker. The worker node though periodically throws this error when running "$airflow worker":
[2018-08-09 16:15:43,553] {jobs.py:2574} WARNING - The recorded hostname ip-1.2.3.4 does not match this instance's hostname ip-1.2.3.4.eco.tanonprod.comanyname.io
Traceback (most recent call last):
File "/usr/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/python3.6/site-packages/airflow/bin/cli.py", line 387, in run
run_job.run()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 198, in run
self._execute()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 2527, in _execute
self.heartbeat()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 182, in heartbeat
self.heartbeat_callback(session=session)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 2575, in heartbeat_callback
raise AirflowException("Hostname of job runner does not match")
airflow.exceptions.AirflowException: Hostname of job runner does not match
[2018-08-09 16:15:43,671] {celery_executor.py:54} ERROR - Command 'airflow run arl_source_emr_test_dag runEmrStep2WaiterTask 2018-08-07T00:00:00 --local -sd /var/lib/airflow/dags/arl_source_emr_test_dag.py' returned non-zero exit status 1.
[2018-08-09 16:15:43,681: ERROR/ForkPoolWorker-30] Task airflow.executors.celery_executor.execute_command[875a4da9-582e-4c10-92aa-5407f3b46d5f] raised unexpected: AirflowException('Celery command failed',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 52, in execute_command
subprocess.check_call(command, shell=True)
File "/usr/lib64/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'airflow run arl_source_emr_test_dag runEmrStep2WaiterTask 2018-08-07T00:00:00 --local -sd /var/lib/airflow/dags/arl_source_emr_test_dag.py' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/dist-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python3.6/dist-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 55, in execute_command
raise AirflowException('Celery command failed')
airflow.exceptions.AirflowException: Celery command failed
When this error occurs the task is marked as failed on Airflow and thus fails my DAG when nothing actually went wrong in the task.
I'm using Redis as my queue and postgreSQL as my meta-database. Both are external as AWS services. I'm running all of this on my company environment which is why the full name of the server is ip-1.2.3.4.eco.tanonprod.comanyname.io. It looks like it wants this full name somewhere but I have no idea where I need to fix this value so that it's getting ip-1.2.3.4.eco.tanonprod.comanyname.io instead of just ip-1.2.3.4.
The really weird thing about this issue is that it doesn't always happen. It seems to just randomly happen every once in a while when I run the DAG. It's also occurring on all of my DAGs sporadically so it's not just one DAG. I find it strange though how it's sporadic because that means other task runs are handling the IP address for whatever this is just fine.
Note: I've changed the real IP address to 1.2.3.4 for privacy reasons.
Answer:
https://github.com/apache/incubator-airflow/pull/2484
This is exactly the problem I am having and other Airflow users on AWS EC2-Instances are experiencing it as well.
The hostname is set when the task instance runs, and is set to self.hostname = socket.getfqdn(), where socket is the python package import socket.
The comparison that triggers this error is:
fqdn = socket.getfqdn()
if fqdn != ti.hostname:
logging.warning("The recorded hostname {ti.hostname} "
"does not match this instance's hostname "
"{fqdn}".format(**locals()))
raise AirflowException("Hostname of job runner does not match")
It seems like the hostname on the ec2 instance is changing on you while the worker is running. Perhaps try manually setting the hostname as described here https://forums.aws.amazon.com/thread.jspa?threadID=246906 and see if that sticks.
I had a similar problem on my Mac. It fixed it setting hostname_callable = socket:gethostname in airflow.cfg.
Personally when running on my Mac, I found that I got similar errors to this when the Mac would sleep while I was running a long job. The solution was to go into System Preferences -> Energy Saver and then check "Prevent computer from sleeping automatically when the display is off."

Resources