I am using GCP cloud composer and have an airflow instance and a DAG.
The DAG's default arguments are these:
# Define the default arguments for the DAG.
default_args = {
# If the start date is set to yesterday, Cloud Composer schedules the workflow to start immediately after the DAG
# uploads.
"start_date": datetime.datetime(2021, 12, 2) #yesterday
,"owner": "foobar"
,"schedule_interval": "*/5 * * * *"
,"tags": ["create_empty_dataset", "BigQuery"]
The DAG is running fine, but the schedule (every 5 minutes) is not. I tried using datetime.timedelta(minutes=5) but it didn't work. It just runs once and that's it. That's all I see:
Any ideas? I can edit and share the DAG code if necessary. It's simple.
Thanks!
Your issue is with how you set the parameters.
default_args is set on DAG level but it contains parameters which are passed to operators. So when you set schedule_interval in default_args it means that this parameter is passed to all operators in your DAG and this has no meaning because the operators don't have schedule_interval parameter.
You should set schedule_interval in the DAG constractor:
DAG = (..., schedule_interval="*/5 * * * *", default_args=default_args)
Related
Does Airflow automatically detect changed variables that are used by already-deployed DAGs, and apply the change immediately, or is a DAG manual restart or refresh required to apply the new value of a changed variable?
If the airflow variable is being used in the DAG code, something like the below, the change (to the python variable interval) will be set to the value of that airflow variable when the scheduler is executed next time. The scheduler executes periodically with a short interval.
# experimental_dag.py
default_args = {
'owner': 'anonymous',
'start_date': datetime.datetime(2021, 12, 1),
}
interval = Variable.get('interval', '#daily')
dag = DAG(
'experimental_dag',
default_args=default_args,
schedule_interval=interval
)
def write_log(ts):
with open('./output.txt', 'a') as f:
f.write(f'{ts}\n')
py_task = PythonOperator(
task_id='load_yesterday_data',
python_callable=write_log,
op_kwargs={
'ts': '{{ ts }}'
},
dag=dag
)
If the DAG code is changed, for example the above experimental_dag.py, in this situation, definitely, the DAG needs to be copied to the dag folder (configured in airflow.cfg).
Airflow Variables are stored in the database. Airflow does not maintain DAG <-> Variable relationships. Variables are not bound to a specific DAG.
The value of a variable is populated when Variable.get() is called in your code.
Seems there there has been previous discussion about this.
How do i stop airflow running a task the first time when i unpause it?
https://groups.google.com/g/cloud-composer-discuss/c/JGtmAd7xcsM?pli=1
When I deploy a dag to run at a specific time (say, once a day at 9AM), Airflow immediately runs the dag at deployment.
dag = DAG(
'My Dag',
default_args=default_args,
schedule_interval='00 09 * * *',
start_date = datetime(2021, 1, 1),
catchup=False # dont run previous and backfill; run only latest
)
That's because with catchup=False, scheduler "creates a DAG run only for the latest interval", as indicated in the doc.
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
What I want to achieve is that I don't even want a DAG run for the latest interval to start. I want nothing to happen until the next time clock strikes 9AM.
It seems like out of the box, Airflow does not have any native solution to this problem.
What are some workarounds that people have been using? Perhaps something like check current time is close to next_execution_date?
When you update your dag you can set start_date to the next day.
However, it won't work if you pause/unpause dag.
Note it's recommended to be a static value (avoid using datetime.now() or similar dynamic values), so for every deployment, you need to specify a new value like datetime(2021, 10, 15), datetime(2021, 10, 16), ... which might make deployment more difficult.
with the dag paused: create dag run http.://.../dagrun/add with Execution Date set to the one needed to skip. This makes task instances in UI accessible
mark success those task instances in the UI
unpause the tag
I have the DAG:
dag = DAG(
dag_id='example_bash_operator',
default_args=args,
schedule_interval='0 0 * * *',
start_date=days_ago(2),
dagrun_timeout=timedelta(minutes=60),
tags=['example']
)
What is the significance of dag.cli()?
What role does cli() play?
if __name__ == "__main__":
dag.cli()
Today is 14th oct. When i add catchup false, it executes for 13 oct. Should not it just execute only for 14th. Without it executes for 12 and 13 which makes sense as it would backfill. But with catchup false why does it execute for 13th oct?
dag = DAG(
dag_id='example_bash_operator',
default_args=args,
schedule_interval='0 0 * * *',
start_date=days_ago(2),
catchup=False,
dagrun_timeout=timedelta(minutes=60),
tags=['example']
)
You should avoid setting the start_date to a relative value - this can lead to unexpected behaviour as this value is newly interpreted everytime the DAG file is parsed.
There is a long description within the Airflow FAQ:
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
Regarding dag.cli(), I would remove this whole part - it's definitely not required by DAG to be executed by airflow scheduler, see this question.
Regarding catchup=Falseand why it executes for the 13th of October - Have a look on scheduler documentation
The scheduler won’t trigger your tasks until the period it covers has ended e.g., A job with schedule_interval set as #daily runs after the day has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed. In the UI, it appears as if Airflow is running your tasks a day late
Note
If you run a DAG on a schedule_interval of one day, the run with execution_date 2019-11-21 triggers soon after 2019-11-21T23:59.
Let’s Repeat That, the scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Also the article Scheduling Tasks in Airflow might be worth a read.
New to airflow coming from cron, trying to understand how the execution_date macro gets applied to the scheduling system and when manually triggered. I've read the faq, and setup a schedule to what I expected would execute with the correct execution_date macro filled in.
I would like to run my dag weekly, on Thursday at 10am UTC. Occasionally I would run it manually. My understanding was the the dag's start date should be one period behind the actual date I want the dag to start. So, in order to execute the dag today, on 4/9/2020, with a 4/9/20020 execution_date I setup the following defaults:
default_args = {
'owner': 'airflow',
'start_date': dt.datetime(2020, 4, 2),
'concurrency': 4,
'retries': 0
}
And the dag is defined as:
with DAG('my_dag',
catchup=False,
default_args=default_args,
schedule_interval='0 10 * * 4',
max_active_runs=1,
concurrency=4,
) as dag:
opr_exc = BashOperator(task_id='execute_dag',bash_command='/path/to/script.sh --dt {{ ds_nodash }}')
While the dag executed on time today 4/9, it executed with the ds_nodash of 20200402 instead of 20200409. I guess I'm still confused since catchup was turned off, start date was one week prior thus I was expecting 20200409.
Now, I found another answer here, that basically explains that execution_date is at the start of the period, and always one period behind. So going forward should I be using next_ds_nodash? Wouldn't this create a problem for manually triggered dags, since execution_date works as expected when run on-demand. Or does next_ds_nodash translate to ds_nodash when manually triggered?
Question: Is there a happy medium that allows me to correctly get the execution_date macro passed over to my weekly run dag when running scheduled AND when manually triggered? What's best practice here?
After a bit more research and testing, it does indeed appear that next_ds_nodash becomes equivalent to ds_nodash when manually triggering the dag.
Thus if you are in a similar situation, do the following to correctly schedule your weekly run job (with optional manually triggers)
Set the start_date one week prior to the date you actually want to start
Configure the schedule_interval accordingly for when you want to run the job
Use the next execution date macros for wherever you expect to get the expected current execution date for when the job runs.
This works for me, but I don't have to deal with any catchup/backfill options, so YMMV.
I've been assessing Airflow the last few days as a possible replacement tool for our ETL workflows and found some interesting behaviour when a DAG is renamed in Airflow.
If I have a dag in a file called hello_world.py
dag = DAG('hello_world', description='Simple DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 11, 1), catchup=True)
And this dag has been executed for 10 days in November, I then decide that I simply want to change the name of the dag to 'yet_another_hello_world' e.g.
in the same file hello_world.py
dag = DAG('yet_another_hello_world', description='Simple DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 11, 1), catchup=True)
Im simply doing a rename of the job, not change to the business logic etc. When this is deployed into Airflow, it is automatically picked up and registered as a new job, so there are now 2 jobs visible in the DAG view
hello_world
yet_another_hello_world
Becuse of catchup=True in the DAG definition, the scheduler automatically see's this change and registeres a new job yet_another_hello_world it then continues to backfill the missing executions from the 1st of November. It also continues to leave the existing hello_world job intact.
Ultimately, I want this to be a rename of the existing job and not preserve the old hello_world job. Is there a way to indicate to airflow that this is a simple rename?
As a best practice, it is always recommended to create a new dag file when you want to change your dags' name, schedule_interval or start_date.