I am passing a set parameters in SubDagOperator function which then calls PythonOprerator but the Jinja Template or macros are not getting rendered
file_check = SubDagOperator(
task_id=SUBDAG_TASK_ID,
subdag=load_sub_dag(
dag_id='%s.%s' % (DAG_ID, SUBDAG_TASK_ID),
params={
'countries': ["SG"],
'date': '{{ execution_date.strftime("%Y%m%d") }}',
'poke_interval': 60,
'timeout': 60 * 5
},
start_date=default_args['start_date'],
email=default_args['email'],
schedule_interval=None,
),
dag=dag
)
now in the load dag operator call an pythonOperaor as
def load_sub_dag(dag_id, start_date, email, params, schedule_interval):
dag = DAG(
dag_id=dag_id,
schedule_interval=schedule_interval,
start_date=start_date
)
start = PythonOperator(
task_id="start",
python_callable=get_start_time,
provide_context=True,
dag=dag
)
file_paths = source_detail['path'].replace('$date', params['date'])
file_paths = [file_paths.replace("$cc", country) for country in params['countries']]
for file_path in file_paths:
i += 1
check_files = PythonOperator(
task_id="success_file_check_{}_{}".format(source, i),
python_callable=check_success_file,
op_kwargs={"file_path": file_path, "params": params,
"success_file_name": success_file_name,
"hourly": hourly},
provide_context=True,
retries=0,
dag=dag
)
start >> check_files
return dag
Now, as far as I know, the jinja template should get rendered in the op_kawrgs section of check_file pythonoperator but it is not happening rather I am getting the same string in final file name
Also when I see task details I see file name as u'/something/dt={{ execution_date.strftime("%Y%m%d") }}'
Airflow ver - 1.10.2 &
celeryexecutor
Related
I build a dag using this code:
from airlow import DAG
from airlow.operators.bash import BashOperator
from datetime import datetime
default_args = {
'start_date': datetime(2020, 1,1)
}
with DAG('parallel_dag', schedule_interval='#daily', default_args=default_args, catchup=False) as dag:
task_1 = BashOperator(
task_id = 'task_1',
bash_command = 'sleep 3'
)
task_2 = BashOperator(
task_id = 'task_2',
bash_command = 'sleep 3'
)
task_3 = BashOperator(
task_id = 'task_3',
bash_command = 'sleep 3'
)
task_4 = BashOperator(
task_id = 'task_4',
bash_command = 'sleep 3'
)
task_1 >> [task_2, task_3] >> task_4
And it is also not seen using airflow dags list in terminal only standard airflow dags is shown but no my.
And with location of my dag is all good it is located in dags folder
You have a typo in your import statements. Airflow ignores files not containing “airflow” and “DAG”. Your DAG File was not parsed since you have misspelled the word airflow.
We're using Airflow 2.1.0 and want to trigger a DAG and pass a variable to it (an S3 file name) using TriggerDagRunOperator.
I've found examples of this and can pass a static JSON to the next DAG using conf:
#task()
def trigger_target_dag_task(context):
TriggerDagRunOperator(
task_id="trigger_target_dag",
trigger_dag_id="target_dag",
conf={"file_name": "test.txt"}
).execute(context)
However, I cannot find current examples where the conf is dynamically created without using python_callable - this seems close:
Airflow 2.0.0+ - Pass a Dynamically Generated Dictionary to DAG Triggered by TriggerDagRunOperator
https://github.com/apache/airflow/pull/6317#issuecomment-859556243
Is this possible?
Updated question:
This method did not work when I used:
#task()
def trigger_dag_task(context):
TriggerDagRunOperator(
task_id="trigger_dag_task",
trigger_dag_id="target_dag",
conf={"payload": "{{ ti.xcom_pull(task_ids='extract_rss') }}"},
).execute(context)
The target_dag received the conf as a string:
{logging_mixin.py:104} INFO - Remotely received value of {{ ti.xcom_pull(task_ids='extract_rss') }}
Conf is a templated field, so you could use Jinja to pass in any variable. Consider this example based on the official TriggerDagRunOperator example
If the variable (object_name) is within your scope you could do:
Controller DAG:
dag = DAG(
dag_id="example_trigger_controller_dag",
default_args={"owner": "airflow"},
start_date=days_ago(2),
schedule_interval="#once",
tags=['example'],
)
object_name = "my-object-s3-aws"
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="example_trigger_target_dag",
conf={"s3_object": object_name},
dag=dag,
)
Target DAG:
dag = DAG(
dag_id="example_trigger_target_dag",
default_args={"owner": "airflow"},
start_date=days_ago(2),
schedule_interval=None,
tags=['example'],
)
def run_this_func(**context):
print("Remotely received value of {} for key=message".format(
context["dag_run"].conf["s3_object"]))
run_this = PythonOperator(
task_id="run_this", python_callable=run_this_func, dag=dag)
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: $message"',
env={'message': '{{ dag_run.conf["s3_object"] if dag_run else "" }}'},
dag=dag,
)
If the variable is stored as an Airflow Variable you could retrieve it like this:
conf={"s3_object": "{{var.json.s3_object}}"}
If it were an XCom from a previous task, you could do:
conf={"s3_object": "{{ ti.xcom_pull(task_ids='previous_task_id', key='return_value') }}"
Let me know if that worked for you!
docs
Edit:
This is a working example, tested in version 2.0.1, using xcom_pull in conf param:
Controller DAG:
from airflow import DAG
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def _do_something():
return "my-object-s3-aws"
dag = DAG(
dag_id="example_trigger_controller_dag",
default_args={"owner": "airflow"},
start_date=days_ago(2),
schedule_interval="#once",
tags=['example'],
)
task_1 = PythonOperator(task_id='previous_task_id',
python_callable=_do_something)
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="example_trigger_target_dag",
conf={
"s3_object":
"{{ ti.xcom_pull(task_ids='previous_task_id', key='return_value') }}"},
dag=dag,
)
task_1 >> trigger
Target DAG:
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
dag = DAG(
dag_id="example_trigger_target_dag",
default_args={"owner": "airflow"},
start_date=days_ago(2),
schedule_interval=None,
tags=['example'],
)
def run_this_func(**context):
print("Remotely received value of {} ".format(
context["dag_run"].conf["s3_object"]))
run_this = PythonOperator(
task_id="run_this", python_callable=run_this_func, dag=dag)
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: $s3_object"',
env={'s3_object': '{{ dag_run.conf["s3_object"] if dag_run else "" }}'},
dag=dag,
)
Logs from run_this task:
[2021-07-15 19:24:11,410] {logging_mixin.py:104} INFO - Remotely received value of my-object-s3-aws
I'm confused how it's working airflow to run 2 tasks in parallel.
This is my Dag:
import datetime as dt
from airflow import DAG
import os
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator, BranchPythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.dagrun_operator import TriggerDagRunOperator
scriptAirflow = '/home/alexw/scriptAirflow/'
uploadPath='/apps/man-data/data/to_load/'
receiptPath= '/apps/man-data/data/to_receipt/'
def result():
if(os.listdir(receiptPath)):
for files in os.listdir(receiptPath):
if files.startswith('MEM') and files.endswith('.csv'):
return 'mem_script'
pass
print('Launching script for: '+files)
elif files.startswith('FMS') and files.endswith('.csv'):
return 'fms_script'
pass
else:
pass
else:
print('No script to launch')
return "no_script"
pass
def onlyCsvFiles():
if(os.listdir(uploadPath)):
for files in os.listdir(uploadPath):
if files.startswith('MEM') or files.startswith('FMS') and files.endswith('.csv'):
return 'move_good_file'
else:
return 'move_bad_file'
else:
pass
default_args = {
'owner': 'testingA',
'start_date': dt.datetime(2020, 2, 17),
'retries': 1,
}
dag = DAG('tryingAirflow', default_args=default_args, description='airflow20',
schedule_interval=None, catchup=False)
file_sensor = FileSensor(
task_id="file_sensor",
filepath=uploadPath,
fs_conn_id='airflow_db',
poke_interval=10,
dag=dag,
)
onlyCsvFiles=BranchPythonOperator(
task_id='only_csv_files',
python_callable=onlyCsvFiles,
trigger_rule='none_failed',
dag=dag,)
move_good_file = BashOperator(
task_id="move_good_file",
bash_command='python3 '+scriptAirflow+'movingGoodFiles.py "{{ execution_date }}"',
dag=dag,
)
move_bad_file = BashOperator(
task_id="move_bad_file",
bash_command='python3 '+scriptAirflow+'movingBadFiles.py "{{ execution_date }}"',
dag=dag,
)
result_mv = BranchPythonOperator(
task_id='result_mv',
python_callable=result,
trigger_rule='none_failed',
dag=dag,
)
run_Mem_Script = BashOperator(
task_id="mem_script",
bash_command='python3 '+scriptAirflow+'memShScript.py "{{ execution_date }}"',
dag=dag,
)
run_Fms_Script = BashOperator(
task_id="fms_script",
bash_command='python3 '+scriptAirflow+'fmsScript.py "{{ execution_date }}"',
dag=dag,
)
skip_script= BashOperator(
task_id="no_script",
bash_command="echo No script to launch",
dag=dag,
)
rerun_dag=TriggerDagRunOperator(
task_id='rerun_dag',
trigger_dag_id='tryingAirflow',
trigger_rule='none_failed',
dag=dag,
)
onlyCsvFiles.set_upstream(file_sensor)
onlyCsvFiles.set_upstream(file_sensor)
move_good_file.set_upstream(onlyCsvFiles)
move_bad_file.set_upstream(onlyCsvFiles)
result_mv.set_upstream(move_good_file)
result_mv.set_upstream(move_bad_file)
run_Fms_Script.set_upstream(result_mv)
run_Mem_Script.set_upstream(result_mv)
skip_script.set_upstream(result_mv)
rerun_dag.set_upstream(run_Fms_Script)
rerun_dag.set_upstream(run_Mem_Script)
rerun_dag.set_upstream(skip_script)
When it come to choose the task in result, and if i have to call both it only execute one task and skip the other one.
I'd like to execute both task in same time when it's necessary. For my airflow.cfg. Question is: How to run task in parallel (or not if not necessary) with using BranchPythonOperator.
thx for help !
If you wanted to surely run either both scripts or none I would add a dummy task before the two tasks that need to run in parallel. Airflow will always choose one branch to execute when you use the BranchPythonOperator.
I would make these changes:
# import the DummyOperator
from airflow.operators.dummy_operator import DummyOperator
# modify the returns of the function result()
def result():
if(os.listdir(receiptPath)):
for files in os.listdir(receiptPath):
if (files.startswith('MEM') and files.endswith('.csv') or
files.startswith('FMS') and files.endswith('.csv')):
return 'run_scripts'
else:
print('No script to launch')
return "no_script"
# add the dummy task
run_scripts = DummyOperator(
task_id="run_scripts",
dag=dag
)
# add dependency
run_scripts.set_upstream(result_mv)
# CHANGE two of the dependencies to
run_Fms_Script.set_upstream(run_scripts)
run_Mem_Script.set_upstream(run_scripts)
I have to admit I never worked with LocalExecutor working on parallel tasks, but this should make sure you run both tasks in case you want to run the scripts.
EDIT:
If you want to run either none, one of the two, or both I think the easiest way is to create another task that runs both scripts in parallel in bash (or at least it runs them together with &). I would do something like this:
# import the DummyOperator
from airflow.operators.dummy_operator import DummyOperator
# modify the returns of the function result() so that it chooses between 4 different outcomes
def result():
if(os.listdir(receiptPath)):
mem_flag = False
fms_flag = False
for files in os.listdir(receiptPath):
if (files.startswith('MEM') and files.endswith('.csv')):
mem_flag = True
if (files.startswith('FMS') and files.endswith('.csv')):
fms_flag = True
if mem_flag and fms_flag:
return "both_scripts"
elif mem_flag:
return "mem_script"
elif fms_flag:
return "fms_script"
else:
return "no_script"
else:
print('No script to launch')
return "no_script"
# add the 'run both scripts' task
run_both_scripts = BashOperator(
task_id="both_script",
bash_command='python3 '+scriptAirflow+'memShScript.py "{{ execution_date }}" & python3 '+scriptAirflow+'fmsScript.py "{{ execution_date }}" &',
dag=dag,
)
# add dependency
run_both_scripts.set_upstream(result_mv)
Hi I am trying to process multiple files using apache airflow. I tried different options, but ended up using triggerdagrunoperator. So basically i have 2 dags, one is scheduled dag to check the file and it kicks of the trigger dag if file found. but i would like to repeat this for many files. Check one file at a time, if file exist, add parameters and call trigger dag with it.
def conditionally_trigger(context, dag_run_obj):
task_id = context['params']['task_id']
task_instance = context['task_instance']
file_type = task_instance.xcom_pull(task_id, key='file_type')
if file_type is not None and file_type != "":
dag_run_obj.payload = {'file_type': file_type, 'file_name': file_name, 'file_path': full_path}
return dag_run_obj
return None
trigger_dag_run_task = TriggerDagRunOperator(
task_id='trigger_dag_run_task',
trigger_dag_id="trigger_dag",
python_callable=conditionally_trigger,
params={'task_id': check_if_file_exists_task_id},
dag=dag,
)
def execute_check_if_file_exists_task(*args, **kwargs):
input_file_list = ["a","b"]
for item in input_file_list:
full_path = json_data[item]['input_folder_path']
directory = os.listdir(full_path)
for files in directory:
if not re.match(file_name, files):
continue
else:
# true
kwargs['ti'].xcom_push(key='file_type', value=item)
return "trigger_dag_run_task"
#false
return "file_not_found_task"
def execute_file_not_found_task(*args, **kwargs):
logging.info("File Not found path.")
file_not_found_task = PythonOperator(
task_id='file_not_found_task',
retries=3,
provide_context=True,
dag=dag,
python_callable=execute_file_not_found_task,
op_args=[])
check_if_file_exists_task = BranchPythonOperator(
task_id='check_if_file_exists_task',
retries=3,
provide_context=True,
dag=dag,
python_callable=execute_check_if_file_exists_task,
op_args=[])
check_if_file_exists_task.set_downstream(trigger_dag_run_task)
check_if_file_exists_task.set_downstream(file_not_found_task)
I have a dag as below:
ingest_excel.py:
from __future__ import print_function
import time
from builtins import range
from datetime import timedelta
from pprint import pprint
import airflow
from airflow.models import DAG
#from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
args = {
'owner': 'rxie',
'start_date': airflow.utils.dates.days_ago(2),
}
dag = DAG(
dag_id='ingest_excel',
default_args=args,
schedule_interval='0 0 * * *',
dagrun_timeout=timedelta(minutes=60),
)
def print_context(**kwargs):
pprint("DAG info below:")
pprint(kwargs)
return 'Whatever you return gets printed in the logs'
t11_extract_excel_to_csv = PythonOperator(
task_id='t1_extract_excel_to_csv',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t12_upload_csv_to_hdfs_parquet = PythonOperator(
task_id='t12_upload_csv_to_hdfs_parquet',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t13_register_parquet_to_impala = PythonOperator(
task_id='t13_register_parquet_to_impala',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t21_text_to_parquet = PythonOperator(
task_id='t21_text_to_parquet',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t22_register_parquet_to_impala = PythonOperator(
task_id='t22_register_parquet_to_impala',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t31_verify_completion = PythonOperator(
task_id='t31_verify_completion',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t32_send_notification = PythonOperator(
task_id='t32_send_notification',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t11_extract_excel_to_csv >> t12_upload_csv_to_hdfs_parquet
t12_upload_csv_to_hdfs_parquet >> t13_register_parquet_to_impala
t21_text_to_parquet >> t22_register_parquet_to_impala
t13_register_parquet_to_impala >> t31_verify_completion
t22_register_parquet_to_impala >> t31_verify_completion
t31_verify_completion >> t32_send_notification
#if __name__ == "__main__":
# dag.cli()
In DAG GUI it prompts:
Broken DAG: [/root/airflow/dags/ingest_excel.py] python_callable
param must be callable
This is my first dag in Airflow, and I am pretty new to Airflow, it would be greatly appreciated if anyone can shed me some light and sort it out for me.
Thank you in advance.
To elaborate on your issue: your process is broken because you're not passing the function print_context to the PythonOperator, you're passing the result of calling print_context:
[...]
t32_send_notification = PythonOperator(
task_id='t32_send_notification',
provide_context=True,
python_callable=print_context(), # <-- This is the issue.
op_kwargs=None,
dag=dag,
)
[...]
Your function is returning the string 'Whatever you return gets printed in the logs' which is, in turn, being provided to the PythonOperator in the python_callable keyword argument. Airflow is essentially attempting to do the following:
your_return = 'Whatever you return gets printed in the logs'
your_return()
...and you're receiving the error you see. The other contributor is correct in stating that you should change your PythonOperator.python_callable keyword argument to simply print_context
The following option needs to be passed to PythonOperator in the newer versions of airflow:
provide_context=True
Otherwise the ds parameter is not passed to your function. This was a recent change to Airflow that I ran into.
Complete Example:
def print_context(ds, **kwargs):
pprint(kwargs)
print(ds)
return 'Whatever you return gets printed in the logs'
run_this = PythonOperator(
task_id='print_the_context',
provide_context=True,
python_callable=print_context,
dag=dag,
)
I'm not entirely sure why you're code doesn't work. It should work, but a work around is given below.
def print_context(**kwargs):
ds = kwargs['ds']
also the python_callable should be passed like this
python_callable=print_context,