I have a Dag with schedule interval None. I want to trigger this Dag by TriggerDagRunOperator multiple times in a day.
I crated a PreDag with schedule_interval "* 1/12 * * *"
Inside PreDag a task of TriggerDagRunOperator runs that Trigger the main Dag.
As scheduled PreDag runs twice a day 1st time when PreDag runs it trigger the Dag but 2nd time when PreDag runs then task of triggerDagRunOperator show error :
" A Dag Run already exists for dag id {{ dag_id}} at {{ execution_date}} with run id {{ trigger_run_id}}" `
trigger_run = TriggerDagRunOperator(
task_id="main_dag_trigger",
trigger_dag_id=str('DW_Test_TriggerDag'),
pool='branch_pool_limit',
wait_for_completion=True,
poke_interval=20,
trigger_run_id = 'trig__' + str(datetime.now()),
execution_date = '{{ ds }}',
# reset_dag_run = True ,
dag = predag
)`
Is it possible to Trigger a dag multiple times in a day using TriggerDagRunOperator.
Airflow uses execution_date and dag_id as ID for dag run table, so when the dag is triggered for the second time, there is a run with the same execution_date created in the first run.
Why do you have this problem? that's because you are using {{ ds }} as execution_date for the run:
The DAG run’s logical date as YYYY-MM-DD. Same as {{ dag_run.logical_date | ds }}.
which is the date of your run and not the datetime, and the date of two runs triggered in the same day is the same.
You can fix it by replacing {{ ds }} by {{ ts }}
Related
Does anyone know how to get the way a DAG got started (whether it was on a scheduler or manually)? I'm using Airflow 2.1.
I have a DAG that runs on an hourly basis, but there are times that I run it manually to test something. I want to capture how the DAG got started and pass that value to a column in a table where I'm saving some data. This will allow me to filter based on scheduled or manual starts and filter test information.
Thanks!
From an execution context, such as a python_callable provided to a PythonOperator you can access to the DagRun object related to the current execution:
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
Logs output:
[2021-09-08 18:53:52,188] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_dagRun_info
AIRFLOW_CTX_TASK_ID=python_task
AIRFLOW_CTX_EXECUTION_DATE=2021-09-07T00:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=backfill__2021-09-07T00:00:00+00:00
Run type: backfill
Externally triggered ?: False
dag_run.run_type would be: "manual", "scheduled" or "backfill". (not sure if there are others)
external_trigger docs:
external_trigger (bool) -- whether this dag run is externally triggered
Also you could use jinja to access default vairables in templated fields, there is a variable representing the dag_run object:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)
Full DAG:
from airflow import DAG
from airflow.models.dagrun import DagRun
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
default_args = {
"owner": "airflow",
}
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
dag = DAG(
dag_id="example_dagRun_info",
default_args=default_args,
start_date=days_ago(1),
schedule_interval="#once",
tags=["example_dags", "params"],
catchup=False,
)
with dag:
python_task = PythonOperator(
task_id="python_task",
python_callable=_print_dag_run,
)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)
I created a DAG that will run on a weekly basis. Below is what I tried and it's working as expected.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
SCHEDULE_INTERVAL = timedelta(weeks=1, seconds=00, minutes=00, hours=00)
default_args = {
'depends_on_past': False,
'retries': 0,
'retry_delay': timedelta(minutes=2),
'wait_for_downstream': True,
'provide_context': True,
'start_date': datetime(2020, 12, 20, hour=00, minute=00, second=00)
}
with DAG("DAG", default_args=default_args, schedule_interval=SCHEDULE_INTERVAL, catchup=True) as dag:
t1 = BashOperator(
task_id='dag_schedule',
bash_command='echo DAG',
dag=dag)
As per the schedule, it ran on the 27(i.e. 20 in the script). As there is a change in requirement, Now I updated the start date to 30th(i.e 23 in the script) instead of 27(My idea is to start the schedule from 30 and from there onwards every week). When I change the schedule of the DAG i.e. start date from 27 to 30th. DAG is not picking as per the latest start date, not sure why? When I deleted the DAG(as it is test DAG I deleted it, in prod I can't delete it) and created the new DAG with the same name with the latest start date i.e. 30th, it's running as per the schedule.
As per the Airflow DOC's
When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc...
Changing schedule interval always requires changing the dag_id, because previously run TaskInstances will not align with the new schedule interval
Changing start_date without changing schedule_interval is safe, but changing to an earlier start_date will not create any new DagRuns for the time between the new start_date and the old one, so tasks will not automatically backfill to the new dates. If you manually create DagRuns, tasks will be scheduled, as long as the DagRun date is after both the task start_date and the dag start_date.
So if we change start date we need to change the DAG name or delete the existing DAG so that it will be recreated with the same name again(metadata related to previous DAG will be deleted from metadata)
Source
Your DAG as you defined it will be triggered on 6-Jan-2021
Airflow schedule tasks at the END of the interval (See doc reference)
So per your settings:
SCHEDULE_INTERVAL = timedelta(weeks=1, seconds=00, minutes=00, hours=00)
and
'start_date': datetime(2020, 12 , 30, hour=00, minute=00, second=00)
This means the first run will be on 6-Jan-2021 because 30-Dec-2020 + 1 week = 6-Jan-2021 Note that the execution_date of this run will be 2020-12-30
I'm new with airflow and trying to figure out how to pass the DAG run date to each task, I have this in my DAG:
tzinfo=tz.gettz('America/Los_Angeles')
dag_run_date = datetime.now(_tzinfo)
dag = DAG(
'myDag',
default_args=default_args,
schedule_interval = None,
params = {
"runDateTimeTz" : dag_run_date.strftime("%Y-%m-%dT%H:%M:%S.%f%z")
}
)
Then I try to pass the runDateTimeTz parameter to each of my tasks, something like this..
task1 = GKEPodOperator(
image='gcr.io/myJar:1.0.1.45',
cmds=['java'],
arguments=["-jar","myJar.jar", {{params.runDateTimeTz}}"],
dag=dag)
task2 = GKEPodOperator(
image='gcr.io/myJar2:1.0.1.45',
cmds=['java'],
arguments=["-jar","myJar2.jar", {{params.runDateTimeTz}}"],
dag=dag)
My tasks are executed correctly but I was expecting all of them to receive the same run date in params.runDateTimeTz, but It didn't happen, for example task1 gets params.runDateTimeTz=2020-04-16T07:42:47.412716-07:00 and task2 gets params.runDateTimeTz= 2020-04-16T07:43:29.913289-07:00
I suppose this behavior is related to the way airflow fills the params for the DAG, looks like params.runDateTimeTz is gotten only when the task starts to run, but I want to get it before and send it to each task as an argument expecting all the task to get the same value.
Can someone assist me on what I'm doing wrong?
You can use the execution_date or ds from Airflow Macros:
Details: https://airflow.apache.org/docs/stable/macros-ref#default-variables
task1 = GKEPodOperator(
image='gcr.io/myJar:1.0.1.45',
cmds=['java'],
arguments=["-jar","myJar.jar", "{{ ds }}"],
dag=dag)
task2 = GKEPodOperator(
image='gcr.io/myJar2:1.0.1.45',
cmds=['java'],
arguments=["-jar","myJar2.jar", "{{ ds }}"],
dag=dag)
If you need a timestamp you can use {{ ts }}
I'm new to Airflow and I'm trying to run an external DAG (developed and owned by another team), as part of my DAG flow.
I was looking at SubDagOperator, but it seems that for some reason it enforces the name of the subdag to be . which I cannot do as the child dag is owned by a different team.
here is my code sample:
parent_dag = DAG(
dag_id='parent_dag', default_args=args,
schedule_interval=None)
external_dag = SubDagOperator(
subdag=another_teams_dag,
task_id='external_dag',
dag=parent_dag,
trigger_rule=TriggerRule.ALL_DONE
)
and the other team's dag is defined like this:
another_teams_dag = DAG(
dag_id='another_teams_dag', default_args=args,
schedule_interval=None)
but I'm getting this error:
The subdag's dag_id should have the form
'{parent_dag_id}.{this_task_id}'. Expected 'parent_dag.external_dag';
received 'another_teams_dag'.
Any ideas?
What am I missing?
Use TriggerDagRunOperator
More info: https://airflow.apache.org/code.html#airflow.operators.dagrun_operator.TriggerDagRunOperator
Example:
Dag that triggers: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_controller_dag.py
Dag that is triggered: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py
For your case, you can use something like:
trigger = TriggerDagRunOperator(task_id='external_dag',
trigger_dag_id="another_teams_dag",
dag=dag)
I created a dag and scheduled it on a daily basis.
It gets queued every day but tasks don't actually run.
This problem already raised in the past here but the answers didn't help me so it seems there is another problem.
My code is shared below. I replaced the SQL of task t2 with a comment.
Each one of the tasks runs successfully when I run them separately on CLI using "airflow test...".
Can you explain what should be done to make the DAG run?
Thanks!
This is the DAG code:
from datetime import timedelta, datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
default_args = {
'owner' : 'me',
'depends_on_past' : 'true',
'start_date' : datetime(2018, 06, 25),
'email' : ['myemail#moovit.com'],
'email_on_failure':True,
'email_on_retry':False,
'retries' : 2,
'retry_delay' : timedelta(minutes=5)
}
dag = DAG('my_agg_table',
default_args = default_args,
schedule_interval = "30 4 * * *"
)
t1 = BigQueryOperator(
task_id='bq_delete_my_agg_table',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
allow_large_results=True,
bql='''
delete `my_project.agg.my_agg_table`
where date = '{{ macros.ds_add(ds, -1)}}'
''',
dag=dag)
t2 = BigQueryOperator(
task_id='bq_insert_my_agg_table',
use_legacy_sql=False,
write_disposition='WRITE_APPEND',
allow_large_results=True,
bql='''
#standardSQL
Select ... the query continue here.....
''', destination_dataset_table='my_project.agg.my_agg_table',
dag=dag)
t1 >> t2
It is usually very easy to find out about the reason why a task is not being run. When in the Airflow web UI:
select any DAG of interest
now click on the task
again, click on Task Instance Details
In the first row there is a panel Task Instance State
In the box Reason next to it is the reason why a task is being run - or why a task is being ignored
It usually makes sense to check the first task which is not being executed since I saw you have setup depends_on_past=True which can lead to problems if used in a wrong scenario.
More on that here: Airflow 1.9.0 is queuing but not launching tasks