I'm running Airflow with Docker swarm on 5 servers. After using 2 months, there are some errors on Dags like this. These errors occurred in dags that using a custom hive operator (similar to the inner function ) and no error occurred before 2 months. (Nothing changed with Dags...)
Also, if I tried to retry dag, sometimes it succeeded and sometimes it failed.
The really weird thing about this issue is that hive job was not failed. After the task was marked as failed in the airflow webserver (Sigterm), the query was complete after 1~10 mins.
As a result, flow is like this.
Task start -> 5~10 mins -> error (sigterm, airflow) -> 1~10 mins -> hive job success (hadoop log)
[2023-01-09 08:06:07,583] {local_task_job.py:208} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance.
[2023-01-09 08:06:07,588] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 135213
[2023-01-09 08:06:07,588] {taskinstance.py:1236} ERROR - Received SIGTERM. Terminating subprocesses.
[2023-01-09 08:13:42,510] {taskinstance.py:1463} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/airflow/dags/common/operator/hive_q_operator.py", line 81, in execute
cur.execute(statement) # hive query custom operator
File "/home/airflow/.local/lib/python3.8/site-packages/pyhive/hive.py", line 454, in execute
response = self._connection.client.ExecuteStatement(req)
File "/home/airflow/.local/lib/python3.8/site-packages/TCLIService/TCLIService.py", line 280, in ExecuteStatement
return self.recv_ExecuteStatement()
File "/home/airflow/.local/lib/python3.8/site-packages/TCLIService/TCLIService.py", line 292, in recv_ExecuteStatement
(fname, mtype, rseqid) = iprot.readMessageBegin()
File "/home/airflow/.local/lib/python3.8/site-packages/thrift/protocol/TBinaryProtocol.py", line 134, in readMessageBegin
sz = self.readI32()
File "/home/airflow/.local/lib/python3.8/site-packages/thrift/protocol/TBinaryProtocol.py", line 217, in readI32
buff = self.trans.readAll(4)
File "/home/airflow/.local/lib/python3.8/site-packages/thrift/transport/TTransport.py", line 62, in readAll
chunk = self.read(sz - have)
File "/home/airflow/.local/lib/python3.8/site-packages/thrift_sasl/__init__.py", line 173, in read
self._read_frame()
File "/home/airflow/.local/lib/python3.8/site-packages/thrift_sasl/__init__.py", line 177, in _read_frame
header = self._trans_read_all(4)
File "/home/airflow/.local/lib/python3.8/site-packages/thrift_sasl/__init__.py", line 210, in _trans_read_all
return read_all(sz)
File "/home/airflow/.local/lib/python3.8/site-packages/thrift/transport/TTransport.py", line 62, in readAll
chunk = self.read(sz - have)
File "/home/airflow/.local/lib/python3.8/site-packages/thrift/transport/TSocket.py", line 150, in read
buff = self.handle.recv(sz)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1238, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
I already restarted the airflow server and there was nothing changed.
Here is the failed task's log (flower log)
Is there any helpful guide for me?
Thanks :)
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/celery/app/trace.py", line 412, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/celery/app/trace.py", line 704, in __protected_call__
return self.run(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 88, in execute_command
_execute_in_fork(command_to_exec)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 99, in _execute_in_fork
raise AirflowException('Celery command failed on host: ' + get_hostname())
airflow.exceptions.AirflowException: Celery command failed on host: 8be4caa25d17
Related
I got the following flink exception when I run pyflink processing job:
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
File "/usr/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
self.run()
File "/usr/lib64/python3.6/threading.py", line 885, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib64/python3.6/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
target=lambda: self._read_inputs(elements_iterator),
File "/usr/local/lib64/python3.6/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
for elements in elements_iterator:
File "/usr/local/lib64/python3.6/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/usr/local/lib64/python3.6/site-packages/grpc/_channel.py", line 826, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.CANCELLED
details = "Multiplexer hanging up"
debug_error_string = "{"created":"#1664983018.802636895","description":"Error received from peer ipv4:127.0.0.1:44675","file":"src/core/lib/surface/call.cc","file_line":904,"grpc_message":"Multiplexer hanging up","grpc_status":1}"
>
the sink result is loaded with :
resultB = tableA.flat_map(name).alias('name') \
.select(col('name'))\
.execute_insert('allowed_table').wait()
resultA = tableA.flat_map(name).alias('name') \
.select(col('name'))\
.execute_insert('allowed_table').wait()
I solved this problem by removing .wait()
in terminal:
ngrok config add-authtoken -----personal_TOKEN-----
when ı tryed for this ı'm getting this error
ı downloaded the pyngrok but still ı have this issue.
Thanks for any help !
Traceback (most recent call last):
File "/home/selman/PycharmProjects/tringle-case/venv/lib/python3.8/site-packages/pyngrok/installer.py", line 94, in install_ngrok
download_path = _download_file(url, **kwargs)
File "/home/selman/PycharmProjects/tringle-case/venv/lib/python3.8/site-packages/pyngrok/installer.py", line 257, in _download_file
raise e
File "/home/selman/PycharmProjects/tringle-case/venv/lib/python3.8/site-packages/pyngrok/installer.py", line 235, in _download_file
buffer = response.read(chunk_size)
File "/usr/lib/python3.8/http/client.py", line 459, in read
n = self.readinto(b)
File "/usr/lib/python3.8/http/client.py", line 503, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.8/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/selman/PycharmProjects/tringle-case/venv/bin/ngrok", line 8, in <module>
sys.exit(main())
File "/home/selman/PycharmProjects/tringle-case/venv/lib/python3.8/site-packages/pyngrok/ngrok.py", line 501, in main
run(sys.argv[1:])
File "/home/selman/PycharmProjects/tringle-case/venv/lib/python3.8/site-packages/pyngrok/ngrok.py", line 487, in run
install_ngrok(pyngrok_config)
File "/home/selman/PycharmProjects/tringle-case/venv/lib/python3.8/site-packages/pyngrok/ngrok.py", line 98, in install_ngrok
installer.install_ngrok(pyngrok_config.ngrok_path)
File "/home/selman/PycharmProjects/tringle-case/venv/lib/python3.8/site-packages/pyngrok/installer.py", line 98, in install_ngrok
raise PyngrokNgrokInstallError("An error occurred while downloading ngrok from {}: {}".format(url, e))
pyngrok.exception.PyngrokNgrokInstallError: An error occurred while downloading ngrok from https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip: The read operation timed out
The documentation is faulty. Try this:
ngrok.exe authtoken xxxxxxxxxxxxxxxxxxxxxxxxx
Replace xxxxxxx with your token.
Say I have the following DAG (stuff omitted for clarity)
#dag.py
from airflow.operators.python import PytonOperator
def main():
print("Task 1")
#some code
print("Task 2")
#some more code
print("Done")
return 0
t1 = PythonOperator(python_callable=main)
t1
Say the program fails at #some more code due to e.g RAM-issues I just get an error in my log e.g
[2021-05-25 12:49:54,211] {process_utils.py:137} INFO - Output:
[2021-05-25 12:52:44,605] {taskinstance.py:1482} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python.py", line 493, in execute
super().execute(context=serializable_context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python.py", line 117, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python.py", line 531, in execute_callable
string_args_filename,
File "/usr/local/lib/python3.6/site-packages/airflow/utils/process_utils.py", line 145, in execute_in_subprocess
raise subprocess.CalledProcessError(exit_code, cmd)
subprocess.CalledProcessError: Command '['/tmp/venv2wbjnabi/bin/python', '/tmp/venv2wbjnabi/script.py', '/tmp/venv2wbjnabi/script.in', '/tmp/venv2wbjnabi/script.out', '/tmp/venv2wbjnabi/string_args.txt']' died with <Signals.SIGKILL: 9>.
[2021-05-25 13:00:55,733] {taskinstance.py:1532} INFO - Marking task as FAILED. dag_id=test_dag, task_id=clean_data, execution_date=20210525T105621, start_date=20210525T105732, end_date=20210525T110055
[2021-05-25 13:00:56,555] {local_task_job.py:146} INFO - Task exited with return code 1
but none of the print-statements are printed thus I don't know where the program failed (I know it now due to debugging).
I assume, due to that, that Airflow don't flush before the task is marked as "success". Is there a way to make Airflow flush on runtime/print on runtime?
I am running a pipeline using Airflow which contains multiple Bash Operators to be executed.
Each operator has on_failure_callback and on_success_callback attributes that call a function to send an email with the status of the task (success/fail) and uploads generated log file from a directory to hdfs. The following code snippets shows a sample of the operator I am using and the callable function.
Bash Operator:
op = BashOperator(
task_id='test_op',
bash_command='python3 run.py' ,
on_failure_callback=fail_email,
on_success_callback = success_email,
retries=3,
dag=dag)
success_email:
def success_email(contextDict,**kwargs):
"""Send custom email alerts."""
# email title.
title = "Task {} SUCCEEDED. Execution date: {}".format(contextDict['task'].task_id, contextDict['execution_date'])
# email contents
body = """
<br>
The correspondent log file:
<br>
{}
""".format(hdfs_log)
print("Uploading log to hdfs")
subprocess.check_call(["hdfs", "dfs", "-mkdir", "-p", hdfs_log_folder])
subprocess.check_call(["hdfs", "dfs", "-put", local_log, hdfs_log])
send_email('email#domain.com', title, html_content=body)
The success_callback always fails when calling the hdfs commands and gives the following error:
[2018-12-28 09:13:29,727] INFO - Uploading log to hdfs
[2018-12-28 09:13:30,344] INFO - [2018-12-28 09:13:30,342] WARNING - State of this instance has been externally set to success. Taking the poison pill.
[2018-12-28 09:13:30,381] INFO - Sending Signals.SIGTERM to GPID 11515
[2018-12-28 09:13:30,382] ERROR - Received SIGTERM. Terminating subprocesses.
[2018-12-28 09:13:30,382] INFO - Sending SIGTERM signal to bash process group
[2018-12-28 09:13:30,390] ERROR - Failed when executing success callback
[2018-12-28 09:13:30,390] ERROR - [Errno 3] No such process
Traceback (most recent call last):
File "/opt/hadoop/airflow/python/lib/python3.6/site-packages/airflow/models.py", line 1687, in _run_raw_task
task.on_success_callback(context)
File "/usr/local/airflow/dags/Appl_FUMA.py", line 139, in success_email
subprocess.check_call(["hdfs", "dfs", "-mkdir", "-p", hdfs_log_folder])
File "/usr/lib64/python3.6/subprocess.py", line 286, in check_call
retcode = call(*popenargs, **kwargs)
File "/usr/lib64/python3.6/subprocess.py", line 269, in call
return p.wait(timeout=timeout)
File "/usr/lib64/python3.6/subprocess.py", line 1457, in wait
(pid, sts) = self._try_wait(0)
File "/usr/lib64/python3.6/subprocess.py", line 1404, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
File "/opt/hadoop/airflow/python/lib/python3.6/site-packages/airflow/models.py", line 1611, in signal_handler
task_copy.on_kill()
File "/opt/hadoop/airflow/python/lib/python3.6/site-packages/airflow/operators/bash_operator.py", line 125, in on_kill
os.killpg(os.getpgid(self.sp.pid), signal.SIGTERM)
ProcessLookupError: [Errno 3] No such process
[2018-12-28 09:13:30,514] INFO - Process psutil.Process(pid=11515 (terminated)) (11515) terminated with exit code 0
[2018-12-28 09:13:30,514] INFO - Process psutil.Process(pid=20649 (terminated)) (20649) terminated with exit code None
[2018-12-28 09:13:30,514] INFO - Process psutil.Process(pid=11530 (terminated)) (11530) terminated with exit code None
[2018-12-28 09:13:30,515] INFO - [2018-12-28 09:13:30,515] INFO - Task exited with return code 0
However, it manages to send emails (sometimes) when I comment out the two lines of subprocesses.
Any idea how to fix this issue?
The robot logfile shows following:
INFO Executing command '/testcase_xy.py'. DEBUG [chan 28] Max
packet in: 32768 bytes INFO Command exited with return code 0.
INFO: Executing command '/testcase_xy_next.py'. DEBUG [chan 43] Max
packet in: 32768 bytes FAIL EOFError DEBUG Traceback (most recent
call last):
File "/usr/lib/python2.6/site-packages/SSHLibrary/library.py", line
881, in execute_command
stdout, stderr, rc = self.current.execute_command(command)
File "/usr/lib/python2.6/site-packages/SSHLibrary/abstractclient.py",
line 219, in execute_command
self.start_command(command)
File "/usr/lib/python2.6/site-packages/SSHLibrary/abstractclient.py",
line 237, in start_command
self._started_commands.append(self._start_command(command))
File "/usr/lib/python2.6/site-packages/SSHLibrary/pythonclient.py",
line 85, in _start_command
cmd.run_in(new_shell)
File "/usr/lib/python2.6/site-packages/SSHLibrary/abstractclient.py",
line 1055, in run_in
self._execute()
File "/usr/lib/python2.6/site-packages/SSHLibrary/pythonclient.py",
line 201, in _execute
self._shell.exec_command(self._command)
File "/usr/lib/python2.6/site-packages/paramiko/channel.py", line 60,
in _check
return func(self, *args, **kwds)
File "/usr/lib/python2.6/site-packages/paramiko/channel.py", line 229,
in exec_command
self._wait_for_event()
File "/usr/lib/python2.6/site-packages/paramiko/channel.py", line
1086, in _wait_for_event
raise e
I have seen these behaviour in several different steps during the test execution.
Can someone help me whats the problem is or how we can solve it?