I'm running Dataflow jobs on GCP by scheduling DAGs with Cloud Composer.
This is my DAG graph. The maf_to_bq_X tasks are grouped together with Airflow TaskGroup; they are BeamRunPythonPipelineOperator Operators that execute on Dataflow. These taks perform the same operations:
Get a file path in variables_conf
Read and process the file
Write the output to BigQuery
I set max_active_runs=1 and max_active_tasks=3.
Every time I run the pipeline, one of those tasks is repeated twice.
After the set_variables_conf task, 3 tasks are taken from the queue and run in parallel; one of these is marked as up_for_retry but still in execution on Dataflow. Since there are no errors, the task is completed and marked as successfull but Airflow schedules the task another time.
This is a big problem because, at the end, I will have duplicates in my BigQuery table.
What's the problem with this DAG?
PS. I've also tried to assign a priority weigth to each task but it still doesn't work
This is the DAG's code:
with models.DAG(
"ieo_dima_extraction",
default_args={
"start_date": pendulum.today('Europe/Rome'),
'retries': 3,
"dataflow_default_options": {
"project": project_id,
"region": "europe-west1",
}
},
on_success_callback=cleanup_xcom,
is_paused_upon_creation=True,
render_template_as_native_obj=True,
max_active_runs=1,
max_active_tasks=3,
schedule_interval=None
) as dag:
set_variables = PythonOperator(
task_id='set_variables_conf',
python_callable=set_variables_conf,
op_kwargs={'bucket_name': 'dima_landing_area'},
do_xcom_push=True
)
list_of_maf = Variable.get('dima_maf_paths', default_var={}, deserialize_json=True)
mapping_file_path = Variable.get('dima_mapping_path', default_var='')
dima_bucket = 'dima_landing_area'
with TaskGroup(group_id='process_maf_files', prefix_group_id=False) as process_maf_files:
if list_of_maf['files']:
for index, maf_file in enumerate(list_of_maf['files']):
maf_to_bq = maf_to_bq_job(index, processing_path, maf_file, bq_data_table, dataset_name, temp_location,
staging_location, project_id, dataflow_region, mapping_file_path)
truncate_map_tmp_start = truncate_job("start", dataset_name, bq_mapping_tmp_table, bq_job_region)
with TaskGroup(group_id='process_mapping', prefix_group_id=False) as process_mapping:
map_to_bq = mapping_to_bq(update_map_path, bq_mapping_tmp_table, dataset_name, sql_view_file, temp_location,
staging_location, project_id, dataflow_region, mapping_file_path)
merge_map = merge_job(query_merge, dataset_name, bq_mapping_table, bq_job_region)
truncate_map_tmp_end = truncate_job("end", dataset_name, bq_mapping_tmp_table, bq_job_region)
update_view = create_view_job(sql_view_script, dataset_name, bq_job_region)
chain(map_to_bq, merge_map, truncate_map_tmp_end, update_view)
chain(set_variables, truncate_map_tmp_start, [process_maf_files, process_mapping])
Related
I'm using apache airflow (2.3.1) to load data into a database. I have more than 150 dags, I need to run some of them first, how can I do this?
The initialization of the work of the dags occurs at 3 am and the dags start to run randomly, standing in a queue.
I read about priority_weight and weight_rule, but this is only used for tasks, not for dag in general.
As I said, the dag queue is built randomly, and I would like to control it and hard-code which dag should be executed first.
You can use an ExternalTaskSensor to define cross-DAG dependencies.
In particular it allows you to wait for an external (= on a different DAG) task or DAG to complete before proceeding. You can configure the dag_id and task_id to wait for and a time-delta for the execution_date (by default, it expects that the external DAG run has the same execution date as the current).
Full details and possible configurations are available in the official documentation: Cross-DAG Dependencies.
Example usage
Task to be executed first
with DAG(
dag_id = 'first_dag',
start_date = datetime(2022, 1, 1),
schedule_interval = '0 0 * * *'
) as first_dag:
first_task = DummyOperator(task_id = 'first_task')
Task to be executed later
with DAG(
dag_id = 'second_dag',
start_date = datetime(2022, 1, 1),
schedule = '0 0 * * *'
) as second_dag:
first_task_sensor = ExternalTaskSensor(
task_id = 'first_task_sensor',
external_dag_id = 'first_dag',
external_task_id = 'first_task',
timeout = 600,
allowed_states = ['success'],
failed_states = ['failed', 'skipped'],
mode = 'reschedule'
)
second_task = DummyOperator(task_id = 'second_task')
first_task_sensor >> second_task
I've an Airflow DAG where I've a task_group with a loop inside that generates two dynamic tasks. After the task_group I need to perform other actions. My problem is:
Inside the task_group I've a branching operators that validates if the last task should run or not. In case of one of the two flows are completed with success, I want to continue my process. For that I'm using the trigger_rule one_success. My code:
with DAG(
dag_id='hello_world',
schedule_interval=None,
start_date=datetime(2022, 8, 25),
default_args=default_args,
max_active_runs=1,
catchup = False,
concurrency = 1,
) as dag:
task_a = DummyOperator(task_id="task_a")
with TaskGroup(group_id='task_group') as my_group:
my_list = ['a','b']
for i in my_list:
task_b = PythonOperator(
task_id="task_a_".format(i),
python_callable=p_task_1)
var_to_continue = check_status(i)
is_running = ShortCircuitOperator(
task_id="is_{}_running".format(i),
python_callable=lambda x: x in [True],
op_args=[var_to_continue])
task_c = PythonOperator(
task_id="task_a_".format(i),
python_callable=p_task_2)
task_b >> is_running >> task_c
task_d = DummyOperator(task_id="task_c",trigger_rule=TriggerRule.ONE_SUCCESS)
task_a >> my_group >> task_d
My problem is: if one of the iterations return skipped the task_d is always skipped, even one of the flow return success.
Do you know how to resolve this?
Thanks!
After a deep search, I found the problem.
In fact, by default ShortCircuitOperator ignore all the downstream tasks trigger rules, if its value is False, it will cut the circuit, which means it will skip all the downstream tasks (its downstream tasks and their downstream tasks and their downstream tasks, ...).
In Airflow 2.3.0, in this PR, they added a new argument ignore_downstream_trigger_rules with default value True to ignore the downstream trigger rules, but you can stop that by providing a False value.
If you are using a version older than 2.3.0, you should replace the operator ShortCircuitOperator by another solution, for ex:
def check_condition():
if not condition: # add your logic
raise AirflowSkipException()
is_running = PythonOperator(..., python_callable=check_condition)
is_running >> task_c
I have two DAGs:
DAG_A , DAG_B.
DAG_A triggers DAG_B thru TriggerDagRunOperator.
My tasks in DAG_B:
with DAG(
dag_id='DAG_B',
default_args=default_args,
schedule_interval='#once',
description='ETL pipeline for processing users'
) as dag:
start = DummyOperator(
task_id='start')
delete_xcom_task = PostgresOperator(
task_id='clean_up_xcom',
postgres_conn_id='postgres_default',
sql="delete from xcom where dag_id='DAG_A' and task_id='TASK_A' ")
end = DummyOperator(
task_id='end')
#trigger_rule='none_failed')
#num_table is set by DAG_A. Will have an empty list initially.
iterable_string = Variable.get("num_table",default_var="[]")
iterable_list = ast.literal_eval(iterable_string)
for index,table in enumerate(iterable_list):
table = table.strip()
read_src1 = PythonOperator(
task_id=f'Read_Source_data_{table}',
python_callable=read_src,
op_kwargs={'index': index}
)
upload_file_to_directory_bulk1 = PythonOperator(
task_id=f'ADLS_Loading_{table}',
python_callable=upload_file_to_directory_bulk,
op_kwargs={'index': index}
)
write_Snowflake1 = PythonOperator(
task_id=f'Snowflake_Staging_{table}',
python_callable=write_Snowflake,
op_kwargs={'index': index}
)
task_sf_storedproc1 = DummyOperator(
task_id=f'Snowflake_Processing_{table}'
)
start >> read_src1 >> upload_file_to_directory_bulk1 >> write_Snowflake1 >>task_sf_storedproc1 >> delete_xcom_task >> end
After executing airflow db init and making the webserver and scheduler up, DAG_B fails with failure in task delete_xcom_task.
2021-06-22 08:04:43,647] {taskinstance.py:871} INFO - Dependencies not met for <TaskInstance: Target_DIF.clean_up_xcom 2021-06-22T08:04:27.861718+00:00 [queued]>, dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 2 non-success(es). upstream_tasks_state={'total': 2, 'successes': 0, 'skipped': 0, 'failed': 0, 'upstream_failed': 0, 'done': 0}, upstream_task_ids={'Snowflake_Processing_products', 'Snowflake_Processing_inventories'}
[2021-06-22 08:04:43,651] {local_task_job.py:93} INFO - Task is not able to be run
But both DAGs become successful from the second runs.
Can anyone explain me what is happening internally?
How can I avoid the failure during the first run?
Thanks.
I suspect that the problem is in schedule_interval='#once' for DAG_B: When you add the DAG for the first time, the schedule_interval tells the scheduler to run the DAG once. So, DAG_B is triggered once by the scheduler and not by DAG_A. Any preparations that needs to be done by DAG_A for DAG_B to run successfully have not been done yet, therefore DAG_B fails.
Later on, DAG_A runs as scheduled and triggers DAG_B as expected. Both succeed.
To avoid DAG_B being triggered by the scheduler set schedule_interval=None.
I have two tasks in my airflow DAG. One triggers an API call ( Http operator ) and another one keeps checking its status using another api ( Http sensor ). This DAG is scheduled to run every hour & 10 minutes. But some times one execution can take long time to finish for example 20 hours. In such cases all the schedules while the previous task is running is not executing.
For example say if I the job at 01:10 takes 10 hours to finish. Schedules 02:10, 03:10, 04:10, ... 11:10 etc which are supposed to run are getting skipped and only the one at 12:10 is executed.
I am using local executor. I am running airflow server & scheduler using below script.
start_server.sh
export AIRFLOW_HOME=./airflow_home;
export AIRFLOW_GPL_UNIDECODE=yes;
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
airflow initdb;
airflow webserver -p 7200;
start_scheduler.sh
export AIRFLOW_HOME=./airflow_home;
# Connection string for connecting to REST interface server
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
#export AIRFLOW__SMTP__SMTP_PASSWORD=**********;
airflow scheduler;
my_dag_file.py
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': admin_email_ids,
'email_on_failure': False,
'email_on_retry': False
}
DAG_ID = 'reconciliation_job_pipeline'
MANAGEMENT_RES_API_CONNECTION_CONFIG = 'management_api'
DA_REST_API_CONNECTION_CONFIG = 'rest_api'
recon_schedule = Variable.get('recon_cron_expression',"10 * * * *")
dag = DAG(DAG_ID, max_active_runs=1, default_args=default_args,
schedule_interval=recon_schedule,
catchup=False)
dag.doc_md = __doc__
spark_job_end_point = conf['sip_da']['spark_job_end_point']
fetch_index_record_count_config_key = conf['reconciliation'][
'fetch_index_record_count']
fetch_index_record_count = SparkJobOperator(
job_id_key='fetch_index_record_count_job',
config_key=fetch_index_record_count_config_key,
exec_id_req=False,
dag=dag,
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_job',
data={},
method='POST',
endpoint=spark_job_end_point,
headers={
"Content-Type": "application/json"}
)
job_endpoint = conf['sip_da']['job_resource_endpoint']
fetch_index_record_count_status_job = JobStatusSensor(
job_id_key='fetch_index_record_count_job',
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_status_job',
endpoint=job_endpoint,
method='GET',
request_params={'required': 'status'},
headers={"Content-Type": "application/json"},
dag=dag,
poke_interval=15
)
fetch_index_record_count>>fetch_index_record_count_status_job
SparkJobOperator & JobStatusSensor my custom class extending SimpleHttpOperator & HttpSensor.
If I set depends_on_past true will it work as expected?. Another problem I have for this option is some time the status check job will fail. But the next schedule should get trigger. How can I achieve this behavior ?
I think the main discussion point here is what you set is catchup=False, more detail can be found here. So airflow scheduler will skip those task execution and you would see the behavior as you mentioned.
This sounds like you would need to perform catchup if the previous process took longer than expected. You can try to change it catchup=True
I created a dag and scheduled it on a daily basis.
It gets queued every day but tasks don't actually run.
This problem already raised in the past here but the answers didn't help me so it seems there is another problem.
My code is shared below. I replaced the SQL of task t2 with a comment.
Each one of the tasks runs successfully when I run them separately on CLI using "airflow test...".
Can you explain what should be done to make the DAG run?
Thanks!
This is the DAG code:
from datetime import timedelta, datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
default_args = {
'owner' : 'me',
'depends_on_past' : 'true',
'start_date' : datetime(2018, 06, 25),
'email' : ['myemail#moovit.com'],
'email_on_failure':True,
'email_on_retry':False,
'retries' : 2,
'retry_delay' : timedelta(minutes=5)
}
dag = DAG('my_agg_table',
default_args = default_args,
schedule_interval = "30 4 * * *"
)
t1 = BigQueryOperator(
task_id='bq_delete_my_agg_table',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
allow_large_results=True,
bql='''
delete `my_project.agg.my_agg_table`
where date = '{{ macros.ds_add(ds, -1)}}'
''',
dag=dag)
t2 = BigQueryOperator(
task_id='bq_insert_my_agg_table',
use_legacy_sql=False,
write_disposition='WRITE_APPEND',
allow_large_results=True,
bql='''
#standardSQL
Select ... the query continue here.....
''', destination_dataset_table='my_project.agg.my_agg_table',
dag=dag)
t1 >> t2
It is usually very easy to find out about the reason why a task is not being run. When in the Airflow web UI:
select any DAG of interest
now click on the task
again, click on Task Instance Details
In the first row there is a panel Task Instance State
In the box Reason next to it is the reason why a task is being run - or why a task is being ignored
It usually makes sense to check the first task which is not being executed since I saw you have setup depends_on_past=True which can lead to problems if used in a wrong scenario.
More on that here: Airflow 1.9.0 is queuing but not launching tasks