Airflow - Get start time of dag run - airflow

Is it possible to get the actual start time of a dag in Airflow? By start time I mean the exact time the first task of a dag starts running.
I know I can use macros to get the execution date. If the job is ran using trigger_dag this is what I would call a start time but if the job is ran on a daily schedule then {{ execution_date }} returns yesterdays date.
I have also tried to place datetime.now().isoformat() in the body of the dag code and then pass it to a task but this seems to return the time the task is first called rather than when the dag itself started.

{{ dag_run.start_date }} provides the actual start time of the dag

This is an old question, but I am answering it because the accepted answer did not work for me. {{ dag_run.start_date }} changes if the DAG run fails and some tasks are retried.
The solution was to use: {{ dag_run.get_task_instance('start').start_date }} which uses the start date of the first task (DummyOperator task with task_id: start).

I am following the way as you stated:
By start time I mean the exact time the first task of a dag starts running
You can still do this with macros on your first task, try this {{ task.start_date }}
All the variables can be found in TaskInstance class:
https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L746

Related

run next tasks in dag if another dag is complete

dag1:
start >> clean >> end
I have a dag where i run a few tasks. But I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
Is there any way I can import information regarding my "dag2", check its status and if it is in success mode, I can proceed to the clean step
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve the wait_for_dag2 part?
There are some different answers depends on what you want to do:
if you have two dags with the same schedule interval, and you want to make the run of second dag waits the same run of first one, you can use ExternalTaskSensor on the last task of the first dag
if you want to run a dag2, after each run of a dag1 even if it's triggered manually, in this case you need to update dag1 and add a TriggerDagRunOperator and set schedule interval of the second to None
I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
if you have two dags and you don't want to run them in same time to avoid a conflict on an external server/service, you can use one of the first two propositions or just use higher priority for the task of the first dag, and use the same pool (with 1 slot) for the tasks which lead to the conflict, but you will lose the parallelism on these tasks.
Hossein's Approach is the way people usually go. However if you want to get info about any dag run data, you can use the airlfow functionality to get that info. The following appraoch is good when you do not want(or are not allowed) to modify another dag:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_you_want_to_check')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)

Airflow - prevent dagrun from immediately running after deployment/unpause

Seems there there has been previous discussion about this.
How do i stop airflow running a task the first time when i unpause it?
https://groups.google.com/g/cloud-composer-discuss/c/JGtmAd7xcsM?pli=1
When I deploy a dag to run at a specific time (say, once a day at 9AM), Airflow immediately runs the dag at deployment.
dag = DAG(
'My Dag',
default_args=default_args,
schedule_interval='00 09 * * *',
start_date = datetime(2021, 1, 1),
catchup=False # dont run previous and backfill; run only latest
)
That's because with catchup=False, scheduler "creates a DAG run only for the latest interval", as indicated in the doc.
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
What I want to achieve is that I don't even want a DAG run for the latest interval to start. I want nothing to happen until the next time clock strikes 9AM.
It seems like out of the box, Airflow does not have any native solution to this problem.
What are some workarounds that people have been using? Perhaps something like check current time is close to next_execution_date?
When you update your dag you can set start_date to the next day.
However, it won't work if you pause/unpause dag.
Note it's recommended to be a static value (avoid using datetime.now() or similar dynamic values), so for every deployment, you need to specify a new value like datetime(2021, 10, 15), datetime(2021, 10, 16), ... which might make deployment more difficult.
with the dag paused: create dag run http.://.../dagrun/add with Execution Date set to the one needed to skip. This makes task instances in UI accessible
mark success those task instances in the UI
unpause the tag

Understanding Airflow's execution_date and schedule

New to airflow coming from cron, trying to understand how the execution_date macro gets applied to the scheduling system and when manually triggered. I've read the faq, and setup a schedule to what I expected would execute with the correct execution_date macro filled in.
I would like to run my dag weekly, on Thursday at 10am UTC. Occasionally I would run it manually. My understanding was the the dag's start date should be one period behind the actual date I want the dag to start. So, in order to execute the dag today, on 4/9/2020, with a 4/9/20020 execution_date I setup the following defaults:
default_args = {
'owner': 'airflow',
'start_date': dt.datetime(2020, 4, 2),
'concurrency': 4,
'retries': 0
}
And the dag is defined as:
with DAG('my_dag',
catchup=False,
default_args=default_args,
schedule_interval='0 10 * * 4',
max_active_runs=1,
concurrency=4,
) as dag:
opr_exc = BashOperator(task_id='execute_dag',bash_command='/path/to/script.sh --dt {{ ds_nodash }}')
While the dag executed on time today 4/9, it executed with the ds_nodash of 20200402 instead of 20200409. I guess I'm still confused since catchup was turned off, start date was one week prior thus I was expecting 20200409.
Now, I found another answer here, that basically explains that execution_date is at the start of the period, and always one period behind. So going forward should I be using next_ds_nodash? Wouldn't this create a problem for manually triggered dags, since execution_date works as expected when run on-demand. Or does next_ds_nodash translate to ds_nodash when manually triggered?
Question: Is there a happy medium that allows me to correctly get the execution_date macro passed over to my weekly run dag when running scheduled AND when manually triggered? What's best practice here?
After a bit more research and testing, it does indeed appear that next_ds_nodash becomes equivalent to ds_nodash when manually triggering the dag.
Thus if you are in a similar situation, do the following to correctly schedule your weekly run job (with optional manually triggers)
Set the start_date one week prior to the date you actually want to start
Configure the schedule_interval accordingly for when you want to run the job
Use the next execution date macros for wherever you expect to get the expected current execution date for when the job runs.
This works for me, but I don't have to deal with any catchup/backfill options, so YMMV.

When does a airflow dag definition get evaluated?

Suppose I have an airflow dag file that creates a graph like so...
def get_current_info(filename)
current_info = {}
<fill in info in current_info relevant for today's date for given file>
return current_info
files = [
get_current_info("file_001"),
get_current_info("file_002"),
....
]
for f in files:
<some BashOperator bo1 using f's current info dict>
<some BashOperator bo2 using f's current info dict>
....
bo1 >> bo2
....
Since these values in the current_info dict that is used to define the dag changes periodically (here, daily), I would like to know by what process / schedule the dag definition gets updated. (I print the current_info values each run and values appear to be updating, but curious as to how and when exactly this happens).
When does a airflow dag definition get evaluated? referenced anywhere in the docs?
The DAGs are evaluated in every run of the scheduler.
This article describes how the scheduler works and at what stage the DAG files are picked up for evaluation.
After some discussion on the [airflow email list][1], it turns out that airflow builds the dag for each task when it is run (so each tasks includes the overhead of building the dag again (which in my case was very significant)).
See more details on this here: https://stackoverflow.com/a/59995882/8236733

Airflow depends_on_past for whole DAG

Is there a way in airflow of using the depends_on_past for an entire DagRun, not just applied to a Task?
I have a daily DAG, and the Friday DagRun errored on the 4th task however the Saturday and Sunday DagRuns still ran as scheduled. Using depends_on_past = True would have paused the DagRun on the same 4th task, however the first 3 tasks would still have run.
I can see in the DagRun DB table there is a state column that contains failed for the Friday DagRun. What I want is a way configuring a DagRun to not start if the previous DagRun failed, not start and run until finding a Task that previously failed.
Does anyone know if this is possible?
At your first task, set depends_on_past=True and wait_for_downstream=True, the combination will result in that current dag-run runs only if the last run succeeded.
Because by setting the first task at current dag-run would waits for previous
(depends_on_past) and all tasks (wait_for_downstream) to succeed
This question is a bit old but it turns out as a first google search result and the highest rated answer is clearly misleading (and it has made me struggle a bit) so it definitely demands a proper answer. Although the second rated answer should work, there's a cleaner way to do this and I personally find using xcom ugly.
The Airflow has a special operator class designed for monitoring status of tasks from other dag runs or other dags as a whole. So what we need to do is to add a task preceding all the tasks in our dag, checking if the previous run has succeded.
from airflow.sensors.external_task_sensor import ExternalTaskSensor
previous_dag_run_sensor = ExternalTaskSensor(
task_id = 'previous_dag_run_sensor',
dag = our_dag,
external_dag_id = our_dag.dag_id,
execution_delta = our_dag.schedule_interval
)
previous_dag_run_sensor.set_downstream(vertices_of_indegree_zero_from_our_dag)
One possible solution would be to use xcom:
Add 2 PythonOperators start_task and end_task to the DAG.
Make all other tasks depend on start_task
Make end_task depend on all other tasks (set_upstream).
end_task will always push a variable last_success = context['execution_date'] to xcom (xcom_push). (Requires provide_context = True in the PythonOperators).
And start_task will always check xcom (xcom_pull) to see whether there exists a last_success variable with value equal to the previous DagRun's execution_date or to the DAG's start_date (to let the process start).
Example use of xcom:
https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_xcom.py
Here a solution that addresses Marc Lamberti's concern, namely, that 'wait_for_download' is not "recursive".
The solution entails "embedding" your original DAG in between two dummy tasks, a start_task and an end_task.
Such that:
The start_task precedes all your original initial tasks (ie, no other task in your DAG can start until start_task is completed).
A end_task follows all your original ending tasks (ie, all branches in your DAG converge to that dummy end_task).
start_task also directly precedes the end_task.
These conditions are provided by the following code:
start_task >> [_all_your_initial_tasks_here_]
[_all_your_ending_tasks_here] >> end_task
start_task >> end_task
Additionally, one needs to set that start_task has depends_on_past=True and wait_for_downstream=True

Resources