I have the following code:
file_name = gcs_export_uri_template + '/' + TABLE_PREFIX + '_' + TABLE_NAME + '{}.json' #{} is required for the operator. if file is big it breakes it to more files as 1.json 2.json etc
import_orders_op = MySqlToGoogleCloudStorageOperator(
task_id='import_orders',
mysql_conn_id='sqlcon',
google_cloud_storage_conn_id='gcpcon',
provide_context=True,
sql=""" SELECT * FROM {{ params.table_name }} WHERE orders_id > {{ params.last_imported_id }} AND orders_id < {{ ti.xcom_pull('get_max_order_id') }} limit 10 """,
params={'last_imported_id': LAST_IMPORTED_ORDER_ID, 'table_name' : TABLE_NAME},
bucket=GCS_BUCKET_ID,
filename=file_name,
dag=dag)
This works well. However notice that the query have limit 10
When I remove it as:
sql=""" SELECT * FROM {{ params.table_name }} WHERE orders_id > {{ params.last_imported_id }} AND orders_id < {{ ti.xcom_pull('get_max_order_id') }} """,
It fails with:
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/bin/airflow", line 27, in <module>
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask: args.func(args)
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 392, in run
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask: pool=args.pool,
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, in wrapper
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask: result = func(*args, **kwargs)
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1493, in _run_raw_task
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask: result = task_copy.execute(context=context)
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/mysql_to_gcs.py", line 89, in execute
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask: files_to_upload = self._write_local_data_files(cursor)
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/mysql_to_gcs.py", line 134, in _write_local_data_files
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask: json.dump(row_dict, tmp_file_handle)
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask: File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask: for chunk in iterable:
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask: File "/usr/lib/python2.7/json/encoder.py", line 434, in _iterencode
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask: for chunk in _iterencode_dict(o, _current_indent_level):
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask: File "/usr/lib/python2.7/json/encoder.py", line 390, in _iterencode_dict
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask: yield _encoder(value)
[2018-10-08 09:09:38,833] {base_task_runner.py:98} INFO - Subtask: UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 5: invalid start byte
I can only assume that the reason is the file_name with the {}.json maybe if it has too many records and it needs to split the file it can't?
I'm running Airflow 1.9.0
What is the problem here?
Your limit 10 just happens to be returning a clean 10 rows of unambiguous ASCII encoding. However, your larger select is returning something that is not decoding with UTF-8. I had this when my MySQL Connection has no extras set.
If you have no extras at all, edit your connection to have {"charset": "utf8"} in the extras field. If you have extras, just add that key-value pair into the collection.
This should establish an encoding for the MySQL client the hook uses to retrieve records, and things should start decoding correctly. Whether or not they'll write to GCS is an exercise left to you.
Related
I've tried to trigger another dag with some paramters in a TriggerDagRunOperator, but in the triggered dag, the dag_run object is always None.
In the TriggerDagRunOperator, the message param is added into dag_run_obj's payload.
def conditionally_trigger(context, dag_run_obj):
if context['params']['condition_param']:
dag_run_obj.payload = {'message': context['params']['message']}
pp.pprint(dag_run_obj.payload)
return dag_run_obj
trigger = TriggerDagRunOperator(
task_id='test_trigger_dagrun',
trigger_dag_id="example_trigger_target_dag",
python_callable=conditionally_trigger,
params={'condition_param': True, 'message': 'Hello World'},
dag=dag,
)
I expected the triggered DAG could get it using kwargs['dag_run'].conf['message']) but unfortunately it doesn't work.
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag,
)
The dag_run object in kwargs is None
INFO - Executing <Task(PythonOperator): run_this> on 2019-01-18 16:10:18
INFO - Subtask: [2019-01-18 16:10:27,007] {models.py:1433} ERROR - 'NoneType' object has no attribute 'conf'
INFO - Subtask: Traceback (most recent call last):
INFO - Subtask: File "/Library/Python/2.7/site-packages/airflow/models.py", line 1390, in run
INFO - Subtask: result = task_copy.execute(context=context)
INFO - Subtask: File "/Library/Python/2.7/site-packages/airflow/operators/python_operator.py", line 80, in execute
INFO - Subtask: return_value = self.python_callable(*self.op_args, **self.op_kwargs)
INFO - Subtask: File "/Library/Python/2.7/site-packages/airflow/example_dags/example_trigger_target_dag.py", line 52, in run_this_func
INFO - Subtask: print("Remotely received value of {} for key=message".format(kwargs['dag_run'].conf['message']))
INFO - Subtask: AttributeError: 'NoneType' object has no attribute 'conf'
I also printed out the kwargs and indeed the 'dag_run' object is None.
The dags are sample code in Airflow so I'm not sure what happened.
Anybody knows the reason?
INFO - Subtask: kwargs: {u'next_execution_date': None, u'dag_run': None, u'tomorrow_ds_nodash': u'20190119', u'run_id': None, u'dag': <DAG: example_trigger_target_dag>, u'prev_execution_date': None, ...
BTW, If I trigger the DAG from CLI, it worked:
$ airflow trigger_dag 'example_trigger_target_dag' -r 'run_id' --conf '{"message":"test_cli"}'
Logs:
INFO - Subtask: kwargs: {u'next_execution_date': None, u'dag_run': <DagRun example_trigger_target_dag # 2019-01-18 ...
INFO - Subtask: Remotely received value of test_cli for key=message
In Airflow - I'm trying to loop an operator. (BigQueryOperator). The DAG completes even before the query finishes.
What my DAG it essentially does is :
Read a set of insert queries one by one.
trigger each query using BigQueryOperator.
When I'm trying to write 2 records (with 2 insert statements) - after the job i can only see 1 record.
dag
bteqQueries = ReadFile() --Read GCP bucket file and get the list of SQL queries (as text) separated by new line
for currQuery in bteqQueries.split('\n'):
#logging.info("currQuery : {}".format(currQuery))
parameter = {
'cur_query': currQuery
}
logging.info("START $$ : {}".format(parameter.get('cur_query')))
gcs2BQ = BigQueryOperator(
task_id='gcs2bq_insert',
bql=parameter.get('cur_query'),
write_disposition="WRITE_APPEND",
bigquery_conn_id='bigquery_default',
use_legacy_sql='False',
dag=dag,
task_concurrency=1)
logging.info("END $$ : {}".format(parameter.get('cur_query')))
gcs2BQ
Expect all the queries in the Input file (in GCS bucket) to be executed. I had couple of insert queries and expect 2 records in the final bigquery table. But I only see 1 record.
********Below is the log ******
2018-12-19 03:57:16,194] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,190] {gcs2BQ_bteq.py:59} INFO - START $$ : insert into `gproject.bucket.employee_test_stg.employee_test_stg` (emp_id,emp_name,edh_end_dttm) values (2,"srikanth","2099-01-01") ;
[2018-12-19 03:57:16,205] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,201] {models.py:2190} WARNING - schedule_interval is used for <Task(BigQueryOperator): gcs2bq_insert>, though it has been deprecated as a task parameter, you need to specify it as a DAG parameter instead
[2018-12-19 03:57:16,210] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,209] {gcs2BQ_bteq.py:68} INFO - END $$ : insert into `project.bucket.employee_test_stgemployee_test_stg` (emp_id,emp_name,edh_end_dttm) values (2,"srikanth","2099-01-01") ;
[2018-12-19 03:57:16,213] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,213] {gcs2BQ_bteq.py:59} INFO - START $$ : insert into `project.bucket.employee_test_stg` (emp_id,emp_name,edh_end_dttm) values (3,"srikanth","2099-01-01") ;
[2018-12-19 03:57:16,223] {base_task_runner.py:98} INFO - Subtask:
[2018-12-19 03:57:16,218] {models.py:2190} WARNING - schedule_interval is used for <Task(BigQueryOperator): gcs2bq_insert>, though it has been deprecated as a task parameter, you need to specify it as a DAG parameter instead
[2018-12-19 03:57:16,230] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,230] {gcs2BQ_bteq.py:68} INFO - END $$ : insert into `dataset1.adp_etl_stg.employee_test_stg` (emp_id,emp_name,edh_end_dttm) values (3,"srikanth","2099-01-01") ;
[2018-12-19 03:57:16,658] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,655] {bigquery_operator.py:90} INFO - Executing: insert into `dataset1.adp_etl_stg.employee_test_stg` (emp_id,emp_name,edh_end_dttm) values (2,"srikanth","2099-01-01") ;
[2018-12-19 03:57:16,703] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,702] {gcp_api_base_hook.py:74} INFO - Getting connection using `gcloud auth` user, since no key file is defined for hook.
[2018-12-19 03:57:16,848] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,847] {discovery.py:267} INFO - URL being requested: GET https://www.googleapis.com/discovery/v1/apis/bigquery/v2/rest
[2018-12-19 03:57:16,849] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:16,849] {client.py:595} INFO - Attempting refresh to obtain initial access_token
[2018-12-19 03:57:17,012] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:17,011] {discovery.py:852} INFO - URL being requested: POST https://www.googleapis.com/bigquery/v2/projects/gcp-***Project***/jobs?alt=json
[2018-12-19 03:57:17,214] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:17,214] {discovery.py:852} INFO - URL being requested: GET https://www.googleapis.com/bigquery/v2/projects/gcp-***Project***/jobs/job_jqrRn4lK8IHqTArYAVj6cXRfLgDd?alt=json
[2018-12-19 03:57:17,304] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:17,303] {bigquery_hook.py:856} INFO - Waiting for job to complete : gcp-***Project***, job_jqrRn4lK8IHqTArYAVj6cXRfLgDd
[2018-12-19 03:57:22,311] {base_task_runner.py:98} INFO - Subtask: [2018-12-19 03:57:22,310] {discovery.py:852} INFO - URL being requested: GET https://www.googleapis.com/bigquery/v2/projects/gcp-***Project***/jobs/job_jqrRn4lK8IHqTArYAVj6cXRfLgDd?alt=json
Try with the following code:
gcs2BQ = []
for index, currQuery in enumerate(bteqQueries.split('\n')):
logging.info("currQuery : {}".format(currQuery))
parameter = {
'cur_query': currQuery
}
logging.info("START $$ : {}".format(parameter.get('cur_query')))
gcs2BQ.append(BigQueryOperator(
task_id='gcs2bq_insert_{}'.format(index),
bql=parameter.get('cur_query'),
write_disposition="WRITE_APPEND",
bigquery_conn_id='bigquery_default',
use_legacy_sql='False',
dag=dag,
task_concurrency=1))
logging.info("END $$ : {}".format(parameter.get('cur_query')))
if index == 0:
gcs2BQ[0]
else:
gcs2BQ[index - 1] >> gcs2BQ[index]
Basically, the task_id should be unique and you can specify an explicit dependency on queries using the code above.
We are trying to run a simple DAG with 2 tasks which will communicate data via xcom.
DAG file:
from __future__ import print_function
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
}
dag = DAG(
'example_xcom',
schedule_interval="#once",
default_args=args)
value_1 = [1, 2, 3]
def push(**kwargs):
# pushes an XCom without a specific target
kwargs['ti'].xcom_push(key='value from pusher 1', value=value_1)
def puller(**kwargs):
ti = kwargs['ti']
v1 = ti.xcom_pull(key=None, task_ids='push')
assert v1 == value_1
v1 = ti.xcom_pull(key=None, task_ids=['push'])
assert (v1) == (value_1)
push1 = PythonOperator(
task_id='push', dag=dag, python_callable=push)
pull = BashOperator(
task_id='also_run_this',
bash_command='echo {{ ti.xcom_pull(task_ids="push_by_returning") }}',
dag=dag)
pull.set_upstream(push1)
But while running the DAG in airflow we are getting the following exception.
[2018-09-27 16:55:33,431] {base_task_runner.py:98} INFO - Subtask: [2018-09-27 16:55:33,430] {models.py:189} INFO - Filling up the DagBag from /home/airflow/gcs/dags/xcom.py
[2018-09-27 16:55:33,694] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-09-27 16:55:33,694] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/bin/airflow", line 27, in <module>
[2018-09-27 16:55:33,696] {base_task_runner.py:98} INFO - Subtask: args.func(args)
[2018-09-27 16:55:33,697] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
[2018-09-27 16:55:33,697] {base_task_runner.py:98} INFO - Subtask: pool=args.pool,
[2018-09-27 16:55:33,698] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
[2018-09-27 16:55:33,699] {base_task_runner.py:98} INFO - Subtask: result = func(*args, **kwargs)
[2018-09-27 16:55:33,699] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 1492, in _run_raw_task
[2018-09-27 16:55:33,701] {base_task_runner.py:98} INFO - Subtask: result = task_copy.execute(context=context)
[2018-09-27 16:55:33,701] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/operators/python_operator.py", line 89, in execute
[2018-09-27 16:55:33,702] {base_task_runner.py:98} INFO - Subtask: return_value = self.execute_callable()
[2018-09-27 16:55:33,703] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/operators/python_operator.py", line 94, in execute_callable
[2018-09-27 16:55:33,703] {base_task_runner.py:98} INFO - Subtask: return self.python_callable(*self.op_args, **self.op_kwargs)
[2018-09-27 16:55:33,704] {base_task_runner.py:98} INFO - Subtask: File "/home/airflow/gcs/dags/xcom.py", line 22, in push
[2018-09-27 16:55:33,707] {base_task_runner.py:98} INFO - Subtask: kwargs['ti'].xcom_push(key='value from pusher 1', value=value_1)
[2018-09-27 16:55:33,708] {base_task_runner.py:98} INFO - Subtask: KeyError: 'ti'
We validated the DAG there is but no issue, Please help us to fix this issue.
Add provide_context: True to default args. This is needed to define **kwargs.
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'provide_context': True
}
provide_context (bool) – if set to true, Airflow will pass a set of keyword arguments that can be used in your function. This set of kwargs correspond exactly to what you can use in your jinja templates. For this to work, you need to define **kwargs in your function header.
Upgraded to v1.9 and I'm having a hard time getting the SSHOperator to work. It was working w/ v1.8.2.
Code
dag = DAG('transfer_ftp_s3', default_args=default_args,schedule_interval=None)
task = SSHOperator(
ssh_conn_id='ssh_node',
task_id="check_ftp_for_new_files",
command="echo 'hello world'",
do_xcom_push=True,
dag=dag,)
Error
[2018-02-19 06:48:02,691] {{base_task_runner.py:98}} INFO - Subtask: Traceback (most recent call last):
[2018-02-19 06:48:02,691] {{base_task_runner.py:98}} INFO - Subtask: File "/usr/bin/airflow", line 27, in <module>
[2018-02-19 06:48:02,692] {{base_task_runner.py:98}} INFO - Subtask: args.func(args)
[2018-02-19 06:48:02,693] {{base_task_runner.py:98}} INFO - Subtask: File "/usr/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
[2018-02-19 06:48:02,695] {{base_task_runner.py:98}} INFO - Subtask: pool=args.pool,
[2018-02-19 06:48:02,695] {{base_task_runner.py:98}} INFO - Subtask: File "/usr/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
[2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask: result = func(*args, **kwargs)
[2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask: File "/usr/lib/python2.7/site-packages/airflow/models.py", line 1496, in _run_raw_task
[2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask: result = task_copy.execute(context=context)
[2018-02-19 06:48:02,697] {{base_task_runner.py:98}} INFO - Subtask: File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/ssh_operator.py", line 146, in execute
[2018-02-19 06:48:02,697] {{base_task_runner.py:98}} INFO - Subtask: raise AirflowException("SSH operator error: {0}".format(str(e)))
[2018-02-19 06:48:02,698] {{base_task_runner.py:98}} INFO - Subtask: airflow.exceptions.AirflowException: SSH operator error: 'bool' object has no attribute 'lower'
as AIRFLOW-2122 check your connections setting , make sure the extra's value using a string instead of a bool
I have an issue where the BashOperator is not logging all of the output from wget. It'll log only the first 1-5 lines of the output.
I have tried this with only wget as the bash command:
tester = BashOperator(
task_id = 'testing',
bash_command = "wget -N -r -nd --directory-prefix='/tmp/' http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip",
dag = dag)
I've also tried this as part of a longer bash script that has other commands that follow wget. Airflow does wait for the script to complete before firing downstream tasks. Here's an example bash script:
#!/bin/bash
echo "Starting up..."
wget -N -r -nd --directory-prefix='/tmp/' http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip
echo "Download complete..."
unzip /tmp/httpcomponents-client-4.5.3-src.zip -o -d /tmp/test_airflow
echo "Archive unzipped..."
Last few lines of a log file:
[2017-04-13 18:33:34,214] {base_task_runner.py:95} INFO - Subtask: --------------------------------------------------------------------------------
[2017-04-13 18:33:34,214] {base_task_runner.py:95} INFO - Subtask: Starting attempt 1 of 1
[2017-04-13 18:33:34,215] {base_task_runner.py:95} INFO - Subtask: --------------------------------------------------------------------------------
[2017-04-13 18:33:34,215] {base_task_runner.py:95} INFO - Subtask:
[2017-04-13 18:33:35,068] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:33:35,068] {models.py:1342} INFO - Executing <Task(BashOperator): testing> on 2017-04-13 18:33:08
[2017-04-13 18:33:37,569] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:33:37,569] {bash_operator.py:71} INFO - tmp dir root location:
[2017-04-13 18:33:37,569] {base_task_runner.py:95} INFO - Subtask: /tmp
[2017-04-13 18:33:37,571] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:33:37,571] {bash_operator.py:81} INFO - Temporary script location :/tmp/airflowtmpqZhPjB//tmp/airflowtmpqZhPjB/testingCkJgDE
[2017-04-13 18:14:54,943] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,942] {bash_operator.py:82} INFO - Running command: /var/www/upstream/xtractor/scripts/Temp_test.sh
[2017-04-13 18:14:54,951] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,950] {bash_operator.py:91} INFO - Output:
[2017-04-13 18:14:54,955] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,954] {bash_operator.py:96} INFO - Starting up...
[2017-04-13 18:14:54,958] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,957] {bash_operator.py:96} INFO - --2017-04-13 18:14:54-- http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip
[2017-04-13 18:14:55,106] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,105] {bash_operator.py:96} INFO - Resolving apache.cs.utah.edu (apache.cs.utah.edu)... 155.98.64.87
[2017-04-13 18:14:55,186] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,186] {bash_operator.py:96} INFO - Connecting to apache.cs.utah.edu (apache.cs.utah.edu)|155.98.64.87|:80... connected.
[2017-04-13 18:14:55,284] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,284] {bash_operator.py:96} INFO - HTTP request sent, awaiting response... 200 OK
[2017-04-13 18:14:55,285] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,284] {bash_operator.py:96} INFO - Length: 1662639 (1.6M) [application/zip]
[2017-04-13 18:15:01,485] {jobs.py:2083} INFO - Task exited with return code 0
Edit: More testing suggests that it's a problem logging the output of wget.
Its because in the default operator only last line is printed. Please replace the code with the following inside airflow/operators/bash_operator.py where ever your airflow is installed. Usually, you need to look in where your python is and then go to site-packages
from builtins import bytes
import os
import signal
import logging
from subprocess import Popen, STDOUT, PIPE
from tempfile import gettempdir, NamedTemporaryFile
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.utils.file import TemporaryDirectory
class BashOperator(BaseOperator):
"""
Execute a Bash script, command or set of commands.
:param bash_command: The command, set of commands or reference to a
bash script (must be '.sh') to be executed.
:type bash_command: string
:param xcom_push: If xcom_push is True, the last line written to stdout
will also be pushed to an XCom when the bash command completes.
:type xcom_push: bool
:param env: If env is not None, it must be a mapping that defines the
environment variables for the new process; these are used instead
of inheriting the current process environment, which is the default
behavior. (templated)
:type env: dict
:type output_encoding: output encoding of bash command
"""
template_fields = ('bash_command', 'env')
template_ext = ('.sh', '.bash',)
ui_color = '#f0ede4'
#apply_defaults
def __init__(
self,
bash_command,
xcom_push=False,
env=None,
output_encoding='utf-8',
*args, **kwargs):
super(BashOperator, self).__init__(*args, **kwargs)
self.bash_command = bash_command
self.env = env
self.xcom_push_flag = xcom_push
self.output_encoding = output_encoding
def execute(self, context):
"""
Execute the bash command in a temporary directory
which will be cleaned afterwards
"""
bash_command = self.bash_command
logging.info("tmp dir root location: \n" + gettempdir())
line_buffer = []
with TemporaryDirectory(prefix='airflowtmp') as tmp_dir:
with NamedTemporaryFile(dir=tmp_dir, prefix=self.task_id) as f:
f.write(bytes(bash_command, 'utf_8'))
f.flush()
fname = f.name
script_location = tmp_dir + "/" + fname
logging.info("Temporary script "
"location :{0}".format(script_location))
logging.info("Running command: " + bash_command)
sp = Popen(
['bash', fname],
stdout=PIPE, stderr=STDOUT,
cwd=tmp_dir, env=self.env,
preexec_fn=os.setsid)
self.sp = sp
logging.info("Output:")
line = ''
for line in iter(sp.stdout.readline, b''):
line = line.decode(self.output_encoding).strip()
line_buffer.append(line)
logging.info(line)
sp.wait()
logging.info("Command exited with "
"return code {0}".format(sp.returncode))
if sp.returncode:
raise AirflowException("Bash command failed")
logging.info("\n".join(line_buffer))
if self.xcom_push_flag:
return "\n".join(line_buffer)
def on_kill(self):
logging.info('Sending SIGTERM signal to bash process group')
os.killpg(os.getpgid(self.sp.pid), signal.SIGTERM)
This isn't a complete answer, but it's a big step forward. The problem seems to be an issue with Python's logging function and the output wget produces. Turns out the airflow scheduler was throwing an error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position....
I modified bash_operator.py in the airflow code base so that the bash output is encoded (on line 95):
loging.info(line.encode('utf-8'))
An error is still happening but at least it is appearing in the log file along with the output of the rest of the bash script. The error that appears in the log file now is: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position...
Even though there is still a Python error happening, output is getting logged, so I'm satisfied for now. If someone has an idea of how to resolve this issue in a better manner, I'm open to ideas.