I am running a pipeline using Airflow which contains multiple Bash Operators to be executed.
Each operator has on_failure_callback and on_success_callback attributes that call a function to send an email with the status of the task (success/fail) and uploads generated log file from a directory to hdfs. The following code snippets shows a sample of the operator I am using and the callable function.
Bash Operator:
op = BashOperator(
task_id='test_op',
bash_command='python3 run.py' ,
on_failure_callback=fail_email,
on_success_callback = success_email,
retries=3,
dag=dag)
success_email:
def success_email(contextDict,**kwargs):
"""Send custom email alerts."""
# email title.
title = "Task {} SUCCEEDED. Execution date: {}".format(contextDict['task'].task_id, contextDict['execution_date'])
# email contents
body = """
<br>
The correspondent log file:
<br>
{}
""".format(hdfs_log)
print("Uploading log to hdfs")
subprocess.check_call(["hdfs", "dfs", "-mkdir", "-p", hdfs_log_folder])
subprocess.check_call(["hdfs", "dfs", "-put", local_log, hdfs_log])
send_email('email#domain.com', title, html_content=body)
The success_callback always fails when calling the hdfs commands and gives the following error:
[2018-12-28 09:13:29,727] INFO - Uploading log to hdfs
[2018-12-28 09:13:30,344] INFO - [2018-12-28 09:13:30,342] WARNING - State of this instance has been externally set to success. Taking the poison pill.
[2018-12-28 09:13:30,381] INFO - Sending Signals.SIGTERM to GPID 11515
[2018-12-28 09:13:30,382] ERROR - Received SIGTERM. Terminating subprocesses.
[2018-12-28 09:13:30,382] INFO - Sending SIGTERM signal to bash process group
[2018-12-28 09:13:30,390] ERROR - Failed when executing success callback
[2018-12-28 09:13:30,390] ERROR - [Errno 3] No such process
Traceback (most recent call last):
File "/opt/hadoop/airflow/python/lib/python3.6/site-packages/airflow/models.py", line 1687, in _run_raw_task
task.on_success_callback(context)
File "/usr/local/airflow/dags/Appl_FUMA.py", line 139, in success_email
subprocess.check_call(["hdfs", "dfs", "-mkdir", "-p", hdfs_log_folder])
File "/usr/lib64/python3.6/subprocess.py", line 286, in check_call
retcode = call(*popenargs, **kwargs)
File "/usr/lib64/python3.6/subprocess.py", line 269, in call
return p.wait(timeout=timeout)
File "/usr/lib64/python3.6/subprocess.py", line 1457, in wait
(pid, sts) = self._try_wait(0)
File "/usr/lib64/python3.6/subprocess.py", line 1404, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
File "/opt/hadoop/airflow/python/lib/python3.6/site-packages/airflow/models.py", line 1611, in signal_handler
task_copy.on_kill()
File "/opt/hadoop/airflow/python/lib/python3.6/site-packages/airflow/operators/bash_operator.py", line 125, in on_kill
os.killpg(os.getpgid(self.sp.pid), signal.SIGTERM)
ProcessLookupError: [Errno 3] No such process
[2018-12-28 09:13:30,514] INFO - Process psutil.Process(pid=11515 (terminated)) (11515) terminated with exit code 0
[2018-12-28 09:13:30,514] INFO - Process psutil.Process(pid=20649 (terminated)) (20649) terminated with exit code None
[2018-12-28 09:13:30,514] INFO - Process psutil.Process(pid=11530 (terminated)) (11530) terminated with exit code None
[2018-12-28 09:13:30,515] INFO - [2018-12-28 09:13:30,515] INFO - Task exited with return code 0
However, it manages to send emails (sometimes) when I comment out the two lines of subprocesses.
Any idea how to fix this issue?
Related
I'm running Airflow with Docker swarm on 5 servers. After using 2 months, there are some errors on Dags like this. These errors occurred in dags that using a custom hive operator (similar to the inner function ) and no error occurred before 2 months. (Nothing changed with Dags...)
Also, if I tried to retry dag, sometimes it succeeded and sometimes it failed.
The really weird thing about this issue is that hive job was not failed. After the task was marked as failed in the airflow webserver (Sigterm), the query was complete after 1~10 mins.
As a result, flow is like this.
Task start -> 5~10 mins -> error (sigterm, airflow) -> 1~10 mins -> hive job success (hadoop log)
[2023-01-09 08:06:07,583] {local_task_job.py:208} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance.
[2023-01-09 08:06:07,588] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 135213
[2023-01-09 08:06:07,588] {taskinstance.py:1236} ERROR - Received SIGTERM. Terminating subprocesses.
[2023-01-09 08:13:42,510] {taskinstance.py:1463} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/airflow/dags/common/operator/hive_q_operator.py", line 81, in execute
cur.execute(statement) # hive query custom operator
File "/home/airflow/.local/lib/python3.8/site-packages/pyhive/hive.py", line 454, in execute
response = self._connection.client.ExecuteStatement(req)
File "/home/airflow/.local/lib/python3.8/site-packages/TCLIService/TCLIService.py", line 280, in ExecuteStatement
return self.recv_ExecuteStatement()
File "/home/airflow/.local/lib/python3.8/site-packages/TCLIService/TCLIService.py", line 292, in recv_ExecuteStatement
(fname, mtype, rseqid) = iprot.readMessageBegin()
File "/home/airflow/.local/lib/python3.8/site-packages/thrift/protocol/TBinaryProtocol.py", line 134, in readMessageBegin
sz = self.readI32()
File "/home/airflow/.local/lib/python3.8/site-packages/thrift/protocol/TBinaryProtocol.py", line 217, in readI32
buff = self.trans.readAll(4)
File "/home/airflow/.local/lib/python3.8/site-packages/thrift/transport/TTransport.py", line 62, in readAll
chunk = self.read(sz - have)
File "/home/airflow/.local/lib/python3.8/site-packages/thrift_sasl/__init__.py", line 173, in read
self._read_frame()
File "/home/airflow/.local/lib/python3.8/site-packages/thrift_sasl/__init__.py", line 177, in _read_frame
header = self._trans_read_all(4)
File "/home/airflow/.local/lib/python3.8/site-packages/thrift_sasl/__init__.py", line 210, in _trans_read_all
return read_all(sz)
File "/home/airflow/.local/lib/python3.8/site-packages/thrift/transport/TTransport.py", line 62, in readAll
chunk = self.read(sz - have)
File "/home/airflow/.local/lib/python3.8/site-packages/thrift/transport/TSocket.py", line 150, in read
buff = self.handle.recv(sz)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1238, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
I already restarted the airflow server and there was nothing changed.
Here is the failed task's log (flower log)
Is there any helpful guide for me?
Thanks :)
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/celery/app/trace.py", line 412, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/celery/app/trace.py", line 704, in __protected_call__
return self.run(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 88, in execute_command
_execute_in_fork(command_to_exec)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 99, in _execute_in_fork
raise AirflowException('Celery command failed on host: ' + get_hostname())
airflow.exceptions.AirflowException: Celery command failed on host: 8be4caa25d17
Say I have the following DAG (stuff omitted for clarity)
#dag.py
from airflow.operators.python import PytonOperator
def main():
print("Task 1")
#some code
print("Task 2")
#some more code
print("Done")
return 0
t1 = PythonOperator(python_callable=main)
t1
Say the program fails at #some more code due to e.g RAM-issues I just get an error in my log e.g
[2021-05-25 12:49:54,211] {process_utils.py:137} INFO - Output:
[2021-05-25 12:52:44,605] {taskinstance.py:1482} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python.py", line 493, in execute
super().execute(context=serializable_context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python.py", line 117, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python.py", line 531, in execute_callable
string_args_filename,
File "/usr/local/lib/python3.6/site-packages/airflow/utils/process_utils.py", line 145, in execute_in_subprocess
raise subprocess.CalledProcessError(exit_code, cmd)
subprocess.CalledProcessError: Command '['/tmp/venv2wbjnabi/bin/python', '/tmp/venv2wbjnabi/script.py', '/tmp/venv2wbjnabi/script.in', '/tmp/venv2wbjnabi/script.out', '/tmp/venv2wbjnabi/string_args.txt']' died with <Signals.SIGKILL: 9>.
[2021-05-25 13:00:55,733] {taskinstance.py:1532} INFO - Marking task as FAILED. dag_id=test_dag, task_id=clean_data, execution_date=20210525T105621, start_date=20210525T105732, end_date=20210525T110055
[2021-05-25 13:00:56,555] {local_task_job.py:146} INFO - Task exited with return code 1
but none of the print-statements are printed thus I don't know where the program failed (I know it now due to debugging).
I assume, due to that, that Airflow don't flush before the task is marked as "success". Is there a way to make Airflow flush on runtime/print on runtime?
I have a couple of DAGs running, which execute python scripts which are basically copying data from A to B. But one of the dags throws an error - but data is still getting copied, so somehow it does not have influence on the execution of the python script.
The only "special" what is within this dag is, that the python script builds up a connecton to a postgres database using psycopg2~=2.8.5 but not sure if this is somehow the root cause.
I also checked the permissions for the user, which seem to be fine at least in the dags folder.
Is there any specific timeout value I have to adjust in the config?
[2021-05-19 12:53:42,036] {taskinstance.py:1145} ERROR - Bash command failed 255
Traceback (most recent call last):
File "/hereisthepath/venv/lib64/python3.6/site-packages/airflow/models/taskinstance.py", line 983, in _run_raw_task
result = task_copy.execute(context=context)
File "/hereisthepath/venv/lib64/python3.6/site-packages/airflow/operators/bash_operator.py", line 134, in execute
raise AirflowException("Bash command failed")
airflow.exceptions.AirflowException: Bash command failed
Update: This is the passage of the operator, which fails. I copied the entire function, however the error throws at line 134 ("raise AirflowException("Bash command failed"))
def execute(self, context):
"""
Execute the bash command in a temporary directory
which will be cleaned afterwards
"""
self.log.info("Tmp dir root location: \n %s", gettempdir())
# Prepare env for child process.
env = self.env
if env is None:
env = os.environ.copy()
airflow_context_vars = context_to_airflow_vars(context, in_env_var_format=True)
self.log.debug('Exporting the following env vars:\n%s',
'\n'.join(["{}={}".format(k, v)
for k, v in airflow_context_vars.items()]))
env.update(airflow_context_vars)
self.lineage_data = self.bash_command
with TemporaryDirectory(prefix='airflowtmp') as tmp_dir:
with NamedTemporaryFile(dir=tmp_dir, prefix=self.task_id) as f:
f.write(bytes(self.bash_command, 'utf_8'))
f.flush()
fname = f.name
script_location = os.path.abspath(fname)
self.log.info(
"Temporary script location: %s",
script_location
)
def pre_exec():
# Restore default signal disposition and invoke setsid
for sig in ('SIGPIPE', 'SIGXFZ', 'SIGXFSZ'):
if hasattr(signal, sig):
signal.signal(getattr(signal, sig), signal.SIG_DFL)
os.setsid()
self.log.info("Running command: %s", self.bash_command)
self.sub_process = Popen(
['bash', fname],
stdout=PIPE, stderr=STDOUT,
cwd=tmp_dir, env=env,
preexec_fn=pre_exec)
self.log.info("Output:")
line = ''
for line in iter(self.sub_process.stdout.readline, b''):
line = line.decode(self.output_encoding).rstrip()
self.log.info(line)
self.sub_process.wait()
self.log.info(
"Command exited with return code %s",
self.sub_process.returncode
)
if self.sub_process.returncode:
raise AirflowException("Bash command failed")
if self.xcom_push_flag:
return line
Update2: It really seems, that this behavior is related to the psycopg2: I now tested all other possible error sources and only when I test with the postgres datasource using psycopg2 package, the error occurs. Meanwhile I also upgraded to the most recent version of psycopg2 (2.8.6) but without success.
Maybe this helps for further investigation
I have an Airflow task that runs youtube-dl and works fine. I'm running this with a BashOperator:
YOUTUBE_DL_CMD = (
'/usr/local/bin/youtube-dl -w -i '
'--max-downloads {{ params.max_downloads }} '
'--write-info-json "{{ params.playlist_url }}" '
'-o "{{ params.output }}"'
)
However, when it's done it triggers an error due to "bad exit code". And hence marks my DAG run as failed.
This is being cased by the way that the tool exits after being called. Here is some output about the error that I pulled from the airflow log. Note that --max-download limit reached, aborting is output from youtube-dl when it exits successfully.
{{bash_operator.py:128}} INFO - --max-download limit reached, aborting.
{{bash_operator.py:132}} INFO - Command exited with return code 101
{{taskinstance.py:1047}} ERROR - Bash command failed
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 922, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/bash_operator.py", line 136, in execute
raise AirflowException("Bash command failed")
airflow.exceptions.AirflowException: Bash command failed
Is there a way I can prevent this error from being raised in Airflow?
Exit code 101 looks like a genuine issue - https://github.com/ytdl-org/youtube-dl/blob/826dcff99cd0a44ec5fa94f0e0201f5115d097ef/youtube_dl/init.py#L467
You can hide youtube_dl exit code from Airflow:
YOUTUBE_DL_CMD = (
'/usr/local/bin/youtube-dl -w -i '
'--max-downloads {{ params.max_downloads }} '
'--write-info-json "{{ params.playlist_url }}" '
'-o "{{ params.output }}" || true'
)
Although this is my first attempt at using pexpect, the python3 script using pexpect is pretty simple; yet it fails.
#!/usr/bin/env python3
import sys
import pexpect
SSH_NEWKEY = r'Are you sure you want to continue connecting \(yes/no\)\?'
child = pexpect.spawn("ssh -i /user/aws/key.pem ec2-user#xxx.xxx.xxx.xxx date")
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY )
if i == 1:
child.sendline('yes')
print(child.before)
The SSH_NEWKEY is the only response I'm expecting, but the example showed a list containing pexpect.TIMEOUT in it so I used it.
$ ./test.py
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 144, in read_nonblocking
s = os.read(self.child_fd, size)
OSError: [Errno 5] Input/output error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 97, in expect_loop
incoming = spawn.read_nonblocking(spawn.maxread, timeout)
File "/usr/local/lib/python3.4/site-packages/pexpect/pty_spawn.py", line 455, in read_nonblocking
return super(spawn, self).read_nonblocking(size)
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 149, in read_nonblocking
raise EOF('End Of File (EOF). Exception style platform.')
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./min.py", line 15, in <module>
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY ] )
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 315, in expect
timeout, searchwindowsize, async)
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 339, in expect_list
return exp.expect_loop(timeout)
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 102, in expect_loop
return self.eof(e)
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 49, in eof
raise EOF(msg)
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
<pexpect.pty_spawn.spawn object at 0x7f70ea4fbcf8>
command: /usr/bin/ssh
args: ['/usr/bin/ssh', '-i', '/user/aws/key.pem', 'ec2-user#xxx.xxx.xxx.xxx', 'date']
searcher: None
buffer (last 100 chars): b''
before (last 100 chars): b'Fri May 6 13:50:18 EDT 2016\r\n'
after: <class 'pexpect.exceptions.EOF'>
match: None
match_index: None
exitstatus: 0
flag_eof: True
pid: 31293
child_fd: 5
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
What am I missing?
CentOS 6.4
python 3.4.3
An EOF error is being raised during your expect call. This means that the response received does not match SSH_NEWKEY, and reaches end of file within the timeout period. To catch this exception, you should change your except line to read:
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY, pexpect.EOF)
You can then make your if more robust:
if i == 1:
child.sendline('yes')
elif i == 0:
print "Timeout"
elif i == 2:
print "EOF"
print(child.before)
This doesn't solve the reason behind why you are on receiving a response with the expected string - it's hard to know without looking at more code but it's likely because you have the response slightly wrong. If you manually type in the SSH string, you should be able to see the response you can expect, and enter this response into your code.
You can also print child.before after your expect call, or print child.read() instead of your expect call to see what is being sent back as a response.