I've tried to trigger another dag with some paramters in a TriggerDagRunOperator, but in the triggered dag, the dag_run object is always None.
In the TriggerDagRunOperator, the message param is added into dag_run_obj's payload.
def conditionally_trigger(context, dag_run_obj):
if context['params']['condition_param']:
dag_run_obj.payload = {'message': context['params']['message']}
pp.pprint(dag_run_obj.payload)
return dag_run_obj
trigger = TriggerDagRunOperator(
task_id='test_trigger_dagrun',
trigger_dag_id="example_trigger_target_dag",
python_callable=conditionally_trigger,
params={'condition_param': True, 'message': 'Hello World'},
dag=dag,
)
I expected the triggered DAG could get it using kwargs['dag_run'].conf['message']) but unfortunately it doesn't work.
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag,
)
The dag_run object in kwargs is None
INFO - Executing <Task(PythonOperator): run_this> on 2019-01-18 16:10:18
INFO - Subtask: [2019-01-18 16:10:27,007] {models.py:1433} ERROR - 'NoneType' object has no attribute 'conf'
INFO - Subtask: Traceback (most recent call last):
INFO - Subtask: File "/Library/Python/2.7/site-packages/airflow/models.py", line 1390, in run
INFO - Subtask: result = task_copy.execute(context=context)
INFO - Subtask: File "/Library/Python/2.7/site-packages/airflow/operators/python_operator.py", line 80, in execute
INFO - Subtask: return_value = self.python_callable(*self.op_args, **self.op_kwargs)
INFO - Subtask: File "/Library/Python/2.7/site-packages/airflow/example_dags/example_trigger_target_dag.py", line 52, in run_this_func
INFO - Subtask: print("Remotely received value of {} for key=message".format(kwargs['dag_run'].conf['message']))
INFO - Subtask: AttributeError: 'NoneType' object has no attribute 'conf'
I also printed out the kwargs and indeed the 'dag_run' object is None.
The dags are sample code in Airflow so I'm not sure what happened.
Anybody knows the reason?
INFO - Subtask: kwargs: {u'next_execution_date': None, u'dag_run': None, u'tomorrow_ds_nodash': u'20190119', u'run_id': None, u'dag': <DAG: example_trigger_target_dag>, u'prev_execution_date': None, ...
BTW, If I trigger the DAG from CLI, it worked:
$ airflow trigger_dag 'example_trigger_target_dag' -r 'run_id' --conf '{"message":"test_cli"}'
Logs:
INFO - Subtask: kwargs: {u'next_execution_date': None, u'dag_run': <DagRun example_trigger_target_dag # 2019-01-18 ...
INFO - Subtask: Remotely received value of test_cli for key=message
Related
We are using on_retry_callback parameter available in the Airflow operators to do some cleanup activities before the task is retried. If there are exceptions thrown on the on_retry_callback function, the exceptions are not logged in the task_instance's log. Without the exception details, it is getting difficult to debug if there are issues in the on_retry_callback function. If this is the default behavior, is there a workaround to enable logging for the exceptions?.
Note: We are using the airflow 2.0.2 version.
Please let me know if there are any questions.
Sample Dag to explain this is given below.
from datetime import datetime
from airflow.operators.python import PythonOperator
from airflow.models.dag import DAG
def sample_function2():
var = 1 / 0
def on_retry_callback_sample(context):
print(f'on_retry_callback_started')
v = 1 / 0
print(f'on_retry_callback completed')
dag = DAG(
'venkat-test-dag',
description='This is a test dag',
start_date=datetime(2023, 1, 10, 18, 0),
schedule_interval='0 12 * * *',
catchup=False
)
func2 = PythonOperator(task_id='function2',
python_callable=sample_function2,
dag=dag,
retries=2,
on_retry_callback=on_retry_callback_sample)
func2
Log file of this run on the local airflow setup is given below. If you see the last message we see on the log file "on_retry_callback_started" but I expect some ZeroDivisionError after this line and finally the line "on_retry_callback completed". How can I achieve this?.
14f0fed99882
*** Reading local file: /usr/local/airflow/logs/venkat-test-dag/function2/2023-01-13T13:22:03.178261+00:00/1.log
[2023-01-13 13:22:05,091] {{taskinstance.py:877}} INFO - Dependencies all met for <TaskInstance: venkat-test-dag.function2 2023-01-13T13:22:03.178261+00:00 [queued]>
[2023-01-13 13:22:05,128] {{taskinstance.py:877}} INFO - Dependencies all met for <TaskInstance: venkat-test-dag.function2 2023-01-13T13:22:03.178261+00:00 [queued]>
[2023-01-13 13:22:05,128] {{taskinstance.py:1068}} INFO -
--------------------------------------------------------------------------------
[2023-01-13 13:22:05,128] {{taskinstance.py:1069}} INFO - Starting attempt 1 of 3
[2023-01-13 13:22:05,128] {{taskinstance.py:1070}} INFO -
--------------------------------------------------------------------------------
[2023-01-13 13:22:05,143] {{taskinstance.py:1089}} INFO - Executing <Task(PythonOperator): function2> on 2023-01-13T13:22:03.178261+00:00
[2023-01-13 13:22:05,145] {{standard_task_runner.py:52}} INFO - Started process 6947 to run task
[2023-01-13 13:22:05,150] {{standard_task_runner.py:76}} INFO - Running: ['airflow', 'tasks', 'run', 'venkat-test-dag', 'function2', '2023-01-13T13:22:03.178261+00:00', '--job-id', '356', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/dp-etl-mixpanel_stg-24H/dags/venkat-test-dag.py', '--cfg-path', '/tmp/tmpny0mhh4j', '--error-file', '/tmp/tmpul506kro']
[2023-01-13 13:22:05,151] {{standard_task_runner.py:77}} INFO - Job 356: Subtask function2
[2023-01-13 13:22:05,244] {{logging_mixin.py:104}} INFO - Running <TaskInstance: venkat-test-dag.function2 2023-01-13T13:22:03.178261+00:00 [running]> on host 14f0fed99882
[2023-01-13 13:22:05,345] {{taskinstance.py:1283}} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=venkat-test-dag
AIRFLOW_CTX_TASK_ID=function2
AIRFLOW_CTX_EXECUTION_DATE=2023-01-13T13:22:03.178261+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2023-01-13T13:22:03.178261+00:00
[2023-01-13 13:22:05,346] {{taskinstance.py:1482}} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 117, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 128, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/dags/dp-etl-mixpanel_stg-24H/dags/venkat-test-dag.py", line 7, in sample_function2
var = 1 / 0
ZeroDivisionError: division by zero
[2023-01-13 13:22:05,349] {{taskinstance.py:1532}} INFO - Marking task as UP_FOR_RETRY. dag_id=venkat-test-dag, task_id=function2, execution_date=20230113T132203, start_date=20230113T132205, end_date=20230113T132205
[2023-01-13 13:22:05,402] {{local_task_job.py:146}} INFO - Task exited with return code 1
[2023-01-13 13:22:05,459] {{logging_mixin.py:104}} INFO - on_retry_callback_started
Adding as an answer for visibility:
This issue is likely related to a fix which was merged in Airflow version 2.1.3:
https://github.com/apache/airflow/pull/17347
In Airflow - I'm trying to loop an operator. (BigQueryOperator). The DAG completes even before the query finishes.
What my DAG it essentially does is :
Read a set of insert queries one by one.
trigger each query using BigQueryOperator.
When I'm trying to write 2 records (with 2 insert statements) - after the job i can only see 1 record.
dag
bteqQueries = ReadFile() --Read GCP bucket file and get the list of SQL queries (as text) separated by new line
for currQuery in bteqQueries.split('\n'):
#logging.info("currQuery : {}".format(currQuery))
parameter = {
'cur_query': currQuery
}
logging.info("START $$ : {}".format(parameter.get('cur_query')))
gcs2BQ = BigQueryOperator(
task_id='gcs2bq_insert',
bql=parameter.get('cur_query'),
write_disposition="WRITE_APPEND",
bigquery_conn_id='bigquery_default',
use_legacy_sql='False',
dag=dag,
task_concurrency=1)
logging.info("END $$ : {}".format(parameter.get('cur_query')))
gcs2BQ
Expect all the queries in the Input file (in GCS bucket) to be executed. I had couple of insert queries and expect 2 records in the final bigquery table. But I only see 1 record.
********Below is the log ******
2018-12-19 03:57:16,194] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,190] {gcs2BQ_bteq.py:59} INFO - START $$ : insert into `gproject.bucket.employee_test_stg.employee_test_stg` (emp_id,emp_name,edh_end_dttm) values (2,"srikanth","2099-01-01") ;
[2018-12-19 03:57:16,205] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,201] {models.py:2190} WARNING - schedule_interval is used for <Task(BigQueryOperator): gcs2bq_insert>, though it has been deprecated as a task parameter, you need to specify it as a DAG parameter instead
[2018-12-19 03:57:16,210] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,209] {gcs2BQ_bteq.py:68} INFO - END $$ : insert into `project.bucket.employee_test_stgemployee_test_stg` (emp_id,emp_name,edh_end_dttm) values (2,"srikanth","2099-01-01") ;
[2018-12-19 03:57:16,213] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,213] {gcs2BQ_bteq.py:59} INFO - START $$ : insert into `project.bucket.employee_test_stg` (emp_id,emp_name,edh_end_dttm) values (3,"srikanth","2099-01-01") ;
[2018-12-19 03:57:16,223] {base_task_runner.py:98} INFO - Subtask:
[2018-12-19 03:57:16,218] {models.py:2190} WARNING - schedule_interval is used for <Task(BigQueryOperator): gcs2bq_insert>, though it has been deprecated as a task parameter, you need to specify it as a DAG parameter instead
[2018-12-19 03:57:16,230] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,230] {gcs2BQ_bteq.py:68} INFO - END $$ : insert into `dataset1.adp_etl_stg.employee_test_stg` (emp_id,emp_name,edh_end_dttm) values (3,"srikanth","2099-01-01") ;
[2018-12-19 03:57:16,658] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,655] {bigquery_operator.py:90} INFO - Executing: insert into `dataset1.adp_etl_stg.employee_test_stg` (emp_id,emp_name,edh_end_dttm) values (2,"srikanth","2099-01-01") ;
[2018-12-19 03:57:16,703] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,702] {gcp_api_base_hook.py:74} INFO - Getting connection using `gcloud auth` user, since no key file is defined for hook.
[2018-12-19 03:57:16,848] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,847] {discovery.py:267} INFO - URL being requested: GET https://www.googleapis.com/discovery/v1/apis/bigquery/v2/rest
[2018-12-19 03:57:16,849] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,849] {client.py:595} INFO - Attempting refresh to obtain initial access_token
[2018-12-19 03:57:17,012] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:17,011] {discovery.py:852} INFO - URL being requested: POST https://www.googleapis.com/bigquery/v2/projects/gcp-***Project***/jobs?alt=json
[2018-12-19 03:57:17,214] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:17,214] {discovery.py:852} INFO - URL being requested: GET https://www.googleapis.com/bigquery/v2/projects/gcp-***Project***/jobs/job_jqrRn4lK8IHqTArYAVj6cXRfLgDd?alt=json
[2018-12-19 03:57:17,304] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:17,303] {bigquery_hook.py:856} INFO - Waiting for job to complete : gcp-***Project***, job_jqrRn4lK8IHqTArYAVj6cXRfLgDd
[2018-12-19 03:57:22,311] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:22,310] {discovery.py:852} INFO - URL being requested: GET https://www.googleapis.com/bigquery/v2/projects/gcp-***Project***/jobs/job_jqrRn4lK8IHqTArYAVj6cXRfLgDd?alt=json
Try with the following code:
gcs2BQ = []
for index, currQuery in enumerate(bteqQueries.split('\n')):
logging.info("currQuery : {}".format(currQuery))
parameter = {
'cur_query': currQuery
}
logging.info("START $$ : {}".format(parameter.get('cur_query')))
gcs2BQ.append(BigQueryOperator(
task_id='gcs2bq_insert_{}'.format(index),
bql=parameter.get('cur_query'),
write_disposition="WRITE_APPEND",
bigquery_conn_id='bigquery_default',
use_legacy_sql='False',
dag=dag,
task_concurrency=1))
logging.info("END $$ : {}".format(parameter.get('cur_query')))
if index == 0:
gcs2BQ[0]
else:
gcs2BQ[index - 1] >> gcs2BQ[index]
Basically, the task_id should be unique and you can specify an explicit dependency on queries using the code above.
We are trying to run a simple DAG with 2 tasks which will communicate data via xcom.
DAG file:
from __future__ import print_function
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
}
dag = DAG(
'example_xcom',
schedule_interval="#once",
default_args=args)
value_1 = [1, 2, 3]
def push(**kwargs):
# pushes an XCom without a specific target
kwargs['ti'].xcom_push(key='value from pusher 1', value=value_1)
def puller(**kwargs):
ti = kwargs['ti']
v1 = ti.xcom_pull(key=None, task_ids='push')
assert v1 == value_1
v1 = ti.xcom_pull(key=None, task_ids=['push'])
assert (v1) == (value_1)
push1 = PythonOperator(
task_id='push', dag=dag, python_callable=push)
pull = BashOperator(
task_id='also_run_this',
bash_command='echo {{ ti.xcom_pull(task_ids="push_by_returning") }}',
dag=dag)
pull.set_upstream(push1)
But while running the DAG in airflow we are getting the following exception.
[2018-09-27 16:55:33,431] {base_task_runner.py:98} INFO - Subtask: [2018-09-27 16:55:33,430] {models.py:189} INFO - Filling up the DagBag from /home/airflow/gcs/dags/xcom.py
[2018-09-27 16:55:33,694] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-09-27 16:55:33,694] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/bin/airflow", line 27, in <module>
[2018-09-27 16:55:33,696] {base_task_runner.py:98} INFO - Subtask: args.func(args)
[2018-09-27 16:55:33,697] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
[2018-09-27 16:55:33,697] {base_task_runner.py:98} INFO - Subtask: pool=args.pool,
[2018-09-27 16:55:33,698] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
[2018-09-27 16:55:33,699] {base_task_runner.py:98} INFO - Subtask: result = func(*args, **kwargs)
[2018-09-27 16:55:33,699] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 1492, in _run_raw_task
[2018-09-27 16:55:33,701] {base_task_runner.py:98} INFO - Subtask: result = task_copy.execute(context=context)
[2018-09-27 16:55:33,701] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/operators/python_operator.py", line 89, in execute
[2018-09-27 16:55:33,702] {base_task_runner.py:98} INFO - Subtask: return_value = self.execute_callable()
[2018-09-27 16:55:33,703] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/operators/python_operator.py", line 94, in execute_callable
[2018-09-27 16:55:33,703] {base_task_runner.py:98} INFO - Subtask: return self.python_callable(*self.op_args, **self.op_kwargs)
[2018-09-27 16:55:33,704] {base_task_runner.py:98} INFO - Subtask: File "/home/airflow/gcs/dags/xcom.py", line 22, in push
[2018-09-27 16:55:33,707] {base_task_runner.py:98} INFO - Subtask: kwargs['ti'].xcom_push(key='value from pusher 1', value=value_1)
[2018-09-27 16:55:33,708] {base_task_runner.py:98} INFO - Subtask: KeyError: 'ti'
We validated the DAG there is but no issue, Please help us to fix this issue.
Add provide_context: True to default args. This is needed to define **kwargs.
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'provide_context': True
}
provide_context (bool) – if set to true, Airflow will pass a set of keyword arguments that can be used in your function. This set of kwargs correspond exactly to what you can use in your jinja templates. For this to work, you need to define **kwargs in your function header.
Upgraded to v1.9 and I'm having a hard time getting the SSHOperator to work. It was working w/ v1.8.2.
Code
dag = DAG('transfer_ftp_s3', default_args=default_args,schedule_interval=None)
task = SSHOperator(
ssh_conn_id='ssh_node',
task_id="check_ftp_for_new_files",
command="echo 'hello world'",
do_xcom_push=True,
dag=dag,)
Error
[2018-02-19 06:48:02,691] {{base_task_runner.py:98}} INFO - Subtask: Traceback (most recent call last):
[2018-02-19 06:48:02,691] {{base_task_runner.py:98}} INFO - Subtask: File "/usr/bin/airflow", line 27, in <module>
[2018-02-19 06:48:02,692] {{base_task_runner.py:98}} INFO - Subtask: args.func(args)
[2018-02-19 06:48:02,693] {{base_task_runner.py:98}} INFO - Subtask: File "/usr/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
[2018-02-19 06:48:02,695] {{base_task_runner.py:98}} INFO - Subtask: pool=args.pool,
[2018-02-19 06:48:02,695] {{base_task_runner.py:98}} INFO - Subtask: File "/usr/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
[2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask: result = func(*args, **kwargs)
[2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask: File "/usr/lib/python2.7/site-packages/airflow/models.py", line 1496, in _run_raw_task
[2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask: result = task_copy.execute(context=context)
[2018-02-19 06:48:02,697] {{base_task_runner.py:98}} INFO - Subtask: File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/ssh_operator.py", line 146, in execute
[2018-02-19 06:48:02,697] {{base_task_runner.py:98}} INFO - Subtask: raise AirflowException("SSH operator error: {0}".format(str(e)))
[2018-02-19 06:48:02,698] {{base_task_runner.py:98}} INFO - Subtask: airflow.exceptions.AirflowException: SSH operator error: 'bool' object has no attribute 'lower'
as AIRFLOW-2122 check your connections setting , make sure the extra's value using a string instead of a bool
I have an issue where the BashOperator is not logging all of the output from wget. It'll log only the first 1-5 lines of the output.
I have tried this with only wget as the bash command:
tester = BashOperator(
task_id = 'testing',
bash_command = "wget -N -r -nd --directory-prefix='/tmp/' http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip",
dag = dag)
I've also tried this as part of a longer bash script that has other commands that follow wget. Airflow does wait for the script to complete before firing downstream tasks. Here's an example bash script:
#!/bin/bash
echo "Starting up..."
wget -N -r -nd --directory-prefix='/tmp/' http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip
echo "Download complete..."
unzip /tmp/httpcomponents-client-4.5.3-src.zip -o -d /tmp/test_airflow
echo "Archive unzipped..."
Last few lines of a log file:
[2017-04-13 18:33:34,214] {base_task_runner.py:95} INFO - Subtask: --------------------------------------------------------------------------------
[2017-04-13 18:33:34,214] {base_task_runner.py:95} INFO - Subtask: Starting attempt 1 of 1
[2017-04-13 18:33:34,215] {base_task_runner.py:95} INFO - Subtask: --------------------------------------------------------------------------------
[2017-04-13 18:33:34,215] {base_task_runner.py:95} INFO - Subtask:
[2017-04-13 18:33:35,068] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:33:35,068] {models.py:1342} INFO - Executing <Task(BashOperator): testing> on 2017-04-13 18:33:08
[2017-04-13 18:33:37,569] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:33:37,569] {bash_operator.py:71} INFO - tmp dir root location:
[2017-04-13 18:33:37,569] {base_task_runner.py:95} INFO - Subtask: /tmp
[2017-04-13 18:33:37,571] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:33:37,571] {bash_operator.py:81} INFO - Temporary script location :/tmp/airflowtmpqZhPjB//tmp/airflowtmpqZhPjB/testingCkJgDE
[2017-04-13 18:14:54,943] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,942] {bash_operator.py:82} INFO - Running command: /var/www/upstream/xtractor/scripts/Temp_test.sh
[2017-04-13 18:14:54,951] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,950] {bash_operator.py:91} INFO - Output:
[2017-04-13 18:14:54,955] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,954] {bash_operator.py:96} INFO - Starting up...
[2017-04-13 18:14:54,958] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,957] {bash_operator.py:96} INFO - --2017-04-13 18:14:54-- http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip
[2017-04-13 18:14:55,106] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,105] {bash_operator.py:96} INFO - Resolving apache.cs.utah.edu (apache.cs.utah.edu)... 155.98.64.87
[2017-04-13 18:14:55,186] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,186] {bash_operator.py:96} INFO - Connecting to apache.cs.utah.edu (apache.cs.utah.edu)|155.98.64.87|:80... connected.
[2017-04-13 18:14:55,284] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,284] {bash_operator.py:96} INFO - HTTP request sent, awaiting response... 200 OK
[2017-04-13 18:14:55,285] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,284] {bash_operator.py:96} INFO - Length: 1662639 (1.6M) [application/zip]
[2017-04-13 18:15:01,485] {jobs.py:2083} INFO - Task exited with return code 0
Edit: More testing suggests that it's a problem logging the output of wget.
Its because in the default operator only last line is printed. Please replace the code with the following inside airflow/operators/bash_operator.py where ever your airflow is installed. Usually, you need to look in where your python is and then go to site-packages
from builtins import bytes
import os
import signal
import logging
from subprocess import Popen, STDOUT, PIPE
from tempfile import gettempdir, NamedTemporaryFile
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.utils.file import TemporaryDirectory
class BashOperator(BaseOperator):
"""
Execute a Bash script, command or set of commands.
:param bash_command: The command, set of commands or reference to a
bash script (must be '.sh') to be executed.
:type bash_command: string
:param xcom_push: If xcom_push is True, the last line written to stdout
will also be pushed to an XCom when the bash command completes.
:type xcom_push: bool
:param env: If env is not None, it must be a mapping that defines the
environment variables for the new process; these are used instead
of inheriting the current process environment, which is the default
behavior. (templated)
:type env: dict
:type output_encoding: output encoding of bash command
"""
template_fields = ('bash_command', 'env')
template_ext = ('.sh', '.bash',)
ui_color = '#f0ede4'
#apply_defaults
def __init__(
self,
bash_command,
xcom_push=False,
env=None,
output_encoding='utf-8',
*args, **kwargs):
super(BashOperator, self).__init__(*args, **kwargs)
self.bash_command = bash_command
self.env = env
self.xcom_push_flag = xcom_push
self.output_encoding = output_encoding
def execute(self, context):
"""
Execute the bash command in a temporary directory
which will be cleaned afterwards
"""
bash_command = self.bash_command
logging.info("tmp dir root location: \n" + gettempdir())
line_buffer = []
with TemporaryDirectory(prefix='airflowtmp') as tmp_dir:
with NamedTemporaryFile(dir=tmp_dir, prefix=self.task_id) as f:
f.write(bytes(bash_command, 'utf_8'))
f.flush()
fname = f.name
script_location = tmp_dir + "/" + fname
logging.info("Temporary script "
"location :{0}".format(script_location))
logging.info("Running command: " + bash_command)
sp = Popen(
['bash', fname],
stdout=PIPE, stderr=STDOUT,
cwd=tmp_dir, env=self.env,
preexec_fn=os.setsid)
self.sp = sp
logging.info("Output:")
line = ''
for line in iter(sp.stdout.readline, b''):
line = line.decode(self.output_encoding).strip()
line_buffer.append(line)
logging.info(line)
sp.wait()
logging.info("Command exited with "
"return code {0}".format(sp.returncode))
if sp.returncode:
raise AirflowException("Bash command failed")
logging.info("\n".join(line_buffer))
if self.xcom_push_flag:
return "\n".join(line_buffer)
def on_kill(self):
logging.info('Sending SIGTERM signal to bash process group')
os.killpg(os.getpgid(self.sp.pid), signal.SIGTERM)
This isn't a complete answer, but it's a big step forward. The problem seems to be an issue with Python's logging function and the output wget produces. Turns out the airflow scheduler was throwing an error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position....
I modified bash_operator.py in the airflow code base so that the bash output is encoded (on line 95):
loging.info(line.encode('utf-8'))
An error is still happening but at least it is appearing in the log file along with the output of the rest of the bash script. The error that appears in the log file now is: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position...
Even though there is still a Python error happening, output is getting logged, so I'm satisfied for now. If someone has an idea of how to resolve this issue in a better manner, I'm open to ideas.