Apache Airflow How to process multiple files in loop - airflow

Hi I am trying to process multiple files using apache airflow. I tried different options, but ended up using triggerdagrunoperator. So basically i have 2 dags, one is scheduled dag to check the file and it kicks of the trigger dag if file found. but i would like to repeat this for many files. Check one file at a time, if file exist, add parameters and call trigger dag with it.
def conditionally_trigger(context, dag_run_obj):
task_id = context['params']['task_id']
task_instance = context['task_instance']
file_type = task_instance.xcom_pull(task_id, key='file_type')
if file_type is not None and file_type != "":
dag_run_obj.payload = {'file_type': file_type, 'file_name': file_name, 'file_path': full_path}
return dag_run_obj
return None
trigger_dag_run_task = TriggerDagRunOperator(
task_id='trigger_dag_run_task',
trigger_dag_id="trigger_dag",
python_callable=conditionally_trigger,
params={'task_id': check_if_file_exists_task_id},
dag=dag,
)
def execute_check_if_file_exists_task(*args, **kwargs):
input_file_list = ["a","b"]
for item in input_file_list:
full_path = json_data[item]['input_folder_path']
directory = os.listdir(full_path)
for files in directory:
if not re.match(file_name, files):
continue
else:
# true
kwargs['ti'].xcom_push(key='file_type', value=item)
return "trigger_dag_run_task"
#false
return "file_not_found_task"
def execute_file_not_found_task(*args, **kwargs):
logging.info("File Not found path.")
file_not_found_task = PythonOperator(
task_id='file_not_found_task',
retries=3,
provide_context=True,
dag=dag,
python_callable=execute_file_not_found_task,
op_args=[])
check_if_file_exists_task = BranchPythonOperator(
task_id='check_if_file_exists_task',
retries=3,
provide_context=True,
dag=dag,
python_callable=execute_check_if_file_exists_task,
op_args=[])
check_if_file_exists_task.set_downstream(trigger_dag_run_task)
check_if_file_exists_task.set_downstream(file_not_found_task)

Related

Airflow: Jinja template not being rendered in subdag pythonOperator

I am passing a set parameters in SubDagOperator function which then calls PythonOprerator but the Jinja Template or macros are not getting rendered
file_check = SubDagOperator(
task_id=SUBDAG_TASK_ID,
subdag=load_sub_dag(
dag_id='%s.%s' % (DAG_ID, SUBDAG_TASK_ID),
params={
'countries': ["SG"],
'date': '{{ execution_date.strftime("%Y%m%d") }}',
'poke_interval': 60,
'timeout': 60 * 5
},
start_date=default_args['start_date'],
email=default_args['email'],
schedule_interval=None,
),
dag=dag
)
now in the load dag operator call an pythonOperaor as
def load_sub_dag(dag_id, start_date, email, params, schedule_interval):
dag = DAG(
dag_id=dag_id,
schedule_interval=schedule_interval,
start_date=start_date
)
start = PythonOperator(
task_id="start",
python_callable=get_start_time,
provide_context=True,
dag=dag
)
file_paths = source_detail['path'].replace('$date', params['date'])
file_paths = [file_paths.replace("$cc", country) for country in params['countries']]
for file_path in file_paths:
i += 1
check_files = PythonOperator(
task_id="success_file_check_{}_{}".format(source, i),
python_callable=check_success_file,
op_kwargs={"file_path": file_path, "params": params,
"success_file_name": success_file_name,
"hourly": hourly},
provide_context=True,
retries=0,
dag=dag
)
start >> check_files
return dag
Now, as far as I know, the jinja template should get rendered in the op_kawrgs section of check_file pythonoperator but it is not happening rather I am getting the same string in final file name
Also when I see task details I see file name as u'/something/dt={{ execution_date.strftime("%Y%m%d") }}'
Airflow ver - 1.10.2 &
celeryexecutor

Airflow - skip DAG if sensor failed

I have process where I'm waiting for a file every week, but this file is timestamped in its name for the mesurment's date. So I know I'm going to have something this week and the name can be 2020-05-25*.csv up to 2020-05-31*.csv.
The only way I find out to start my processes with airflow is to run a sensor at start #daily and using the executing date to find is there is a file.
The thing is, since I don't know which day the file will be uploaded I will have 6 fails sensors, so 6 failed DAGs, and 1 succeeded.
SFTP Sensors part exemple :
with DAG(
"geometrie-sftp-to-safe",
default_args=default_args,
schedule_interval="#daily",
catchup=True,
) as dag:
starting_sensor = DummyOperator(
task_id="starting_sensor"
)
sensor_sftp_A = SFTPSensor(
task_id="sensor_sftp_A",
path="/input/geometrie/prod/Track_Geometry-{{ ds_nodash }}_A.csv",
sftp_conn_id="ssh_ftp_landing",
poke_interval=60,
soft_fail=True,
mode="reschedule"
)
Second With GCSSensor
with DAG(
"geometrie-preprocessing",
default_args=default_args,
schedule_interval="#daily",
catchup=True
) as dag:
# File A
sensor_gcs_A = GoogleCloudStorageObjectSensor(
task_id="gcs-sensor_A",
bucket="lisea-mesea-sea-cloud-safe",
object="geometrie/original/track_geometry_{{ ds_nodash }}_A.csv",
google_cloud_conn_id="gcp_conn",
poke_interval=50
)
That's why I would like the DAGs to be set as skipped, if and only if the sensor have fail. If it's something else I would like a real fail.
Airflow has multiple sensors which senses the directory to check for the defined file. The schedule_interval as None will work to your use case as you want the DAG to trigger only when the file is received(considering that the file can be received anytime within the week).
The below example for GCSSensor will sense the bucket for the particular type of file and will print the filename.I am pretty sure that SFTP sensor should work the same way.
dag = DAG(
dag_id='sensing-bucket',
schedule_interval=None,
default_args=args)
def new_file_detection(**context):
value = context['ti'].xcom_pull(task_ids='list_Files')
print('value is : '+str(value))
File_sensor = GoogleCloudStoragePrefixSensor(
task_id='gcs_polling',
bucket='lisea-mesea-sea-cloud-safe',
prefix='geometrie/original/track_geometry_',
dag=dag
)
GCS_File_list = GoogleCloudStorageListOperator(
task_id='list_Files',
bucket='lisea-mesea-sea-cloud-safe',
prefix='geometrie/original/track_geometry_',
delimiter='.csv',
google_cloud_storage_conn_id='google_cloud_default',
dag=dag
)
File_detection = PythonOperator(
task_id='print_detected_filename',
provide_context=True,
python_callable=new_file_detection,
dag=dag
)
File_sensor >> GCS_File_list >> File_detection

Airflow run tasks in parallel

I'm confused how it's working airflow to run 2 tasks in parallel.
This is my Dag:
import datetime as dt
from airflow import DAG
import os
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator, BranchPythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.dagrun_operator import TriggerDagRunOperator
scriptAirflow = '/home/alexw/scriptAirflow/'
uploadPath='/apps/man-data/data/to_load/'
receiptPath= '/apps/man-data/data/to_receipt/'
def result():
if(os.listdir(receiptPath)):
for files in os.listdir(receiptPath):
if files.startswith('MEM') and files.endswith('.csv'):
return 'mem_script'
pass
print('Launching script for: '+files)
elif files.startswith('FMS') and files.endswith('.csv'):
return 'fms_script'
pass
else:
pass
else:
print('No script to launch')
return "no_script"
pass
def onlyCsvFiles():
if(os.listdir(uploadPath)):
for files in os.listdir(uploadPath):
if files.startswith('MEM') or files.startswith('FMS') and files.endswith('.csv'):
return 'move_good_file'
else:
return 'move_bad_file'
else:
pass
default_args = {
'owner': 'testingA',
'start_date': dt.datetime(2020, 2, 17),
'retries': 1,
}
dag = DAG('tryingAirflow', default_args=default_args, description='airflow20',
schedule_interval=None, catchup=False)
file_sensor = FileSensor(
task_id="file_sensor",
filepath=uploadPath,
fs_conn_id='airflow_db',
poke_interval=10,
dag=dag,
)
onlyCsvFiles=BranchPythonOperator(
task_id='only_csv_files',
python_callable=onlyCsvFiles,
trigger_rule='none_failed',
dag=dag,)
move_good_file = BashOperator(
task_id="move_good_file",
bash_command='python3 '+scriptAirflow+'movingGoodFiles.py "{{ execution_date }}"',
dag=dag,
)
move_bad_file = BashOperator(
task_id="move_bad_file",
bash_command='python3 '+scriptAirflow+'movingBadFiles.py "{{ execution_date }}"',
dag=dag,
)
result_mv = BranchPythonOperator(
task_id='result_mv',
python_callable=result,
trigger_rule='none_failed',
dag=dag,
)
run_Mem_Script = BashOperator(
task_id="mem_script",
bash_command='python3 '+scriptAirflow+'memShScript.py "{{ execution_date }}"',
dag=dag,
)
run_Fms_Script = BashOperator(
task_id="fms_script",
bash_command='python3 '+scriptAirflow+'fmsScript.py "{{ execution_date }}"',
dag=dag,
)
skip_script= BashOperator(
task_id="no_script",
bash_command="echo No script to launch",
dag=dag,
)
rerun_dag=TriggerDagRunOperator(
task_id='rerun_dag',
trigger_dag_id='tryingAirflow',
trigger_rule='none_failed',
dag=dag,
)
onlyCsvFiles.set_upstream(file_sensor)
onlyCsvFiles.set_upstream(file_sensor)
move_good_file.set_upstream(onlyCsvFiles)
move_bad_file.set_upstream(onlyCsvFiles)
result_mv.set_upstream(move_good_file)
result_mv.set_upstream(move_bad_file)
run_Fms_Script.set_upstream(result_mv)
run_Mem_Script.set_upstream(result_mv)
skip_script.set_upstream(result_mv)
rerun_dag.set_upstream(run_Fms_Script)
rerun_dag.set_upstream(run_Mem_Script)
rerun_dag.set_upstream(skip_script)
When it come to choose the task in result, and if i have to call both it only execute one task and skip the other one.
I'd like to execute both task in same time when it's necessary. For my airflow.cfg. Question is: How to run task in parallel (or not if not necessary) with using BranchPythonOperator.
thx for help !
If you wanted to surely run either both scripts or none I would add a dummy task before the two tasks that need to run in parallel. Airflow will always choose one branch to execute when you use the BranchPythonOperator.
I would make these changes:
# import the DummyOperator
from airflow.operators.dummy_operator import DummyOperator
# modify the returns of the function result()
def result():
if(os.listdir(receiptPath)):
for files in os.listdir(receiptPath):
if (files.startswith('MEM') and files.endswith('.csv') or
files.startswith('FMS') and files.endswith('.csv')):
return 'run_scripts'
else:
print('No script to launch')
return "no_script"
# add the dummy task
run_scripts = DummyOperator(
task_id="run_scripts",
dag=dag
)
# add dependency
run_scripts.set_upstream(result_mv)
# CHANGE two of the dependencies to
run_Fms_Script.set_upstream(run_scripts)
run_Mem_Script.set_upstream(run_scripts)
I have to admit I never worked with LocalExecutor working on parallel tasks, but this should make sure you run both tasks in case you want to run the scripts.
EDIT:
If you want to run either none, one of the two, or both I think the easiest way is to create another task that runs both scripts in parallel in bash (or at least it runs them together with &). I would do something like this:
# import the DummyOperator
from airflow.operators.dummy_operator import DummyOperator
# modify the returns of the function result() so that it chooses between 4 different outcomes
def result():
if(os.listdir(receiptPath)):
mem_flag = False
fms_flag = False
for files in os.listdir(receiptPath):
if (files.startswith('MEM') and files.endswith('.csv')):
mem_flag = True
if (files.startswith('FMS') and files.endswith('.csv')):
fms_flag = True
if mem_flag and fms_flag:
return "both_scripts"
elif mem_flag:
return "mem_script"
elif fms_flag:
return "fms_script"
else:
return "no_script"
else:
print('No script to launch')
return "no_script"
# add the 'run both scripts' task
run_both_scripts = BashOperator(
task_id="both_script",
bash_command='python3 '+scriptAirflow+'memShScript.py "{{ execution_date }}" & python3 '+scriptAirflow+'fmsScript.py "{{ execution_date }}" &',
dag=dag,
)
# add dependency
run_both_scripts.set_upstream(result_mv)

How to use MySqlOperator with xcom in Airflow?

I read this How to use airflow xcoms with MySqlOperator and while it has a similiar title it doesn't really address my issue.
I have the following code:
def branch_func_is_new_records(**kwargs):
ti = kwargs['ti']
xcom = ti.xcom_pull(task_ids='query_get_max_order_id')
string_to_print = 'Value in xcom is: {}'.format(xcom)
logging.info(string_to_print)
if int(xcom) > int(LAST_IMPORTED_ORDER_ID)
return 'import_orders'
else:
return 'skip_operation'
query_get_max_order_id = 'SELECT COALESCE(max(orders_id),0) FROM warehouse.orders where orders_id>1 limit 10'
get_max_order_id = MySqlOperator(
task_id='query_get_max_order_id',
sql= query_get_max_order_id,
mysql_conn_id=MyCon,
xcom_push=True,
dag=dag)
branch_op_is_new_records = BranchPythonOperator(
task_id='branch_operation_is_new_records',
provide_context=True,
python_callable=branch_func_is_new_records,
dag=dag)
get_max_order_id >> branch_op_is_new_records >> import_orders
branch_op_is_new_records >> skip_operation
The MySqlOperator returns a number according to the number the BranchPythonOperator choose the next task. It's guaranteed that the MySqlOperator has returned value greater than 0.
My problem is that nothing is pushed to XCOM by the MySqlOperator
On the UI when I go to XCOM I see nothing. The BranchPythonOperator oviously reads nothing so my code fails.
Why the XCOM doesn't work here?
The MySQL operator currently (airflow 1.10.0 at time of writing) doesn't support returning anything in XCom, so the fix for you for now is to write a small operator yourself. You can do this directly in your DAG file (untested, so there may be silly errors):
from airflow.operators.mysql_operator import MySqlOperator as BaseMySqlOperator
from airflow.hooks.mysql_hook import MySqlHook
class ReturningMySqlOperator(BaseMySqlOperator):
def execute(self, context):
self.log.info('Executing: %s', self.sql)
hook = MySqlHook(mysql_conn_id=self.mysql_conn_id,
schema=self.database)
return hook.get_first(
self.sql,
parameters=self.parameters)
def branch_func_is_new_records(**kwargs):
ti = kwargs['ti']
xcom = ti.xcom_pull(task_ids='query_get_max_order_id')
string_to_print = 'Value in xcom is: {}'.format(xcom)
logging.info(string_to_print)
if str(xcom) == 'NewRecords':
return 'import_orders'
else:
return 'skip_operation'
query_get_max_order_id = 'SELECT COALESCE(max(orders_id),0) FROM warehouse.orders where orders_id>1 limit 10'
get_max_order_id = ReturningMySqlOperator(
task_id='query_get_max_order_id',
sql= query_get_max_order_id,
mysql_conn_id=MyCon,
# xcom_push=True,
dag=dag)
branch_op_is_new_records = BranchPythonOperator(
task_id='branch_operation_is_new_records',
provide_context=True,
python_callable=branch_func_is_new_records,
dag=dag)
get_max_order_id >> branch_op_is_new_records >> import_orders
branch_op_is_new_records >> skip_operation

apache airflow - Cannot load the dag bag to handle failure

I have created a on_failure_callback function(refering Airflow default on_failure_callback) to handle task's failure.
It works well when there is only one task in a DAG, however, if there are 2 more tasks, a task is randomly failed since the operator is null, it can resume later by manully . In airflow-scheduler.out the log is:
[2018-05-08 14:24:21,237] {models.py:1595} ERROR - Executor reports
task instance %s finished (%s) although the task says its %s. Was the
task killed externally? NoneType [2018-05-08 14:24:21,238]
{jobs.py:1435} ERROR - Cannot load the dag bag to handle failure for
. Setting task to FAILED without
callbacks or retries. Do you have enough resources?
The DAG code is:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
import airflow
from devops.util import WechatUtil
from devops.util import JiraUtil
def on_failure_callback(context):
ti = context['task_instance']
log_url = ti.log_url
owner = ti.task.owner
ti_str = str(context['task_instance'])
wechat_msg = "%s - Owner:%s"%(ti_str,owner)
WeChatUtil.notify_team(wechat_msg)
jira_desc = "Please check log from url %s"%(log_url)
JiraUtil.create_incident("DW",ti_str,jira_desc,owner)
args = {
'queue': 'default',
'start_date': airflow.utils.dates.days_ago(1),
'retry_delay': timedelta(minutes=1),
'on_failure_callback': on_failure_callback,
'owner': 'user1',
}
dag = DAG(dag_id='test_dependence1',default_args=args,schedule_interval='10 16 * * *')
load_crm_goods = BashOperator(
task_id='crm_goods_job',
bash_command='date',
dag=dag)
load_crm_memeber = BashOperator(
task_id='crm_member_job',
bash_command='date',
dag=dag)
load_crm_order = BashOperator(
task_id='crm_order_job',
bash_command='date',
dag=dag)
load_crm_eur_invt = BashOperator(
task_id='crm_eur_invt_job',
bash_command='date',
dag=dag)
crm_member_cohort_analysis = BashOperator(
task_id='crm_member_cohort_analysis_job',
bash_command='date',
dag=dag)
crm_member_cohort_analysis.set_upstream(load_crm_goods)
crm_member_cohort_analysis.set_upstream(load_crm_memeber)
crm_member_cohort_analysis.set_upstream(load_crm_order)
crm_member_cohort_analysis.set_upstream(load_crm_eur_invt)
crm_member_kpi_daily = BashOperator(
task_id='crm_member_kpi_daily_job',
bash_command='date',
dag=dag)
crm_member_kpi_daily.set_upstream(crm_member_cohort_analysis)
I had tried to update the airflow.cfg by adding the default memory from 512 to even 4096, but no luck. Would anyone have any advice ?
Ialso try to updated my JiraUtil and WechatUtil as following, encoutering the same error
WechatUtil:
import requests
class WechatUtil:
#staticmethod
def notify_trendy_user(user_ldap_id, message):
return None
#staticmethod
def notify_bigdata_team(message):
return None
JiraUtil:
import json
import requests
class JiraUtil:
#staticmethod
def execute_jql(jql):
return None
#staticmethod
def create_incident(projectKey, summary, desc, assignee=None):
return None
(I'm shooting tracer bullets a bit here, so bear with me if this answer doesn't get it right on the first try.)
The null operator issue with multiple task instances is weird... it would help approaching troubleshooting this if you could boil the current code down to a MCVE e.g., 1–2 operators and excluding the JiraUtil and WechatUtil parts if they're not related to the callback failure.
Here are 2 ideas:
1. Can you try changing the line that fetches the task instance out of the context to see if this makes a difference?
Before:
def on_failure_callback(context):
ti = context['task_instance']
...
After:
def on_failure_callback(context):
ti = context['ti']
...
I saw this usage in the Airflow repo (https://github.com/apache/incubator-airflow/blob/c1d583f91a0b4185f760a64acbeae86739479cdb/airflow/contrib/hooks/qubole_check_hook.py#L88). It's possible it can be accessed both ways.
2. Can you try adding provide_context=True on the operators either as a kwarg or in default_args?

Resources