Airflow - Variables among task - airflow

How do I create a variable in DAG level and pass on to multiple task?
For example :
cluster_name = 'data-' + datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")
I have to use the above variable cluster_name in all task. but I see value keep changing. I do not want to use xcom. Please advise

This value will change all the time because the DAG definition is being parsed repeatedly by the scheduler/webserver/workers, and datetime.now() will return different values every time it is parsed.
I highly recommend against using dynamic task names.
The date is already part of a task in the sense that the execution date is part of what makes each run of the task unique.
Each task instance can be identified by: dag_id + task_id + execution_date
To uniquely identify the tasks, use these things instead of bundling the date inside the name.

You can store it in an Airflow Variable and it should be accessible to all your tasks. Just note that it is a database call each time you look up a Variable.

Related

Passing Trigger DAG Id value xcom

I am building a parent dag that basically does a conditional trigger of another dag depending on a xcom value.
Say, I have the following dags in my system
Load Contracts
Load Budget
Load Sales
Trigger Data Loads (Master)
In my master DAG, I have two tasks - (1) Check File Name (2) Trigger appropriate dag based on file name. If File Name is Sales, then trigger 'Load Sales' and like wise for Budget / Contracts.
Here is how my TriggerDAGRunOperator is configure
def _check_file_version(**context):
context["task_instance"].xcom_push(key="file_name",value="Load Sales")
with dag:
completion_file_check = PythonOperator(
task_id="completion_file_check"
python_callable=_check_file_version,
provide_context=True
)
trigger_dependent_dag = TriggerDagRunOperator(
task_id = "trigger_dependent_dag",
provide_context=True,
trigger_dag_id={{ task_instance.xcom_pull(task_ids='completion_file_check', key='file_name') }},
wait_for_completion=True
)
I want to modify the trigger_dag_id value based on the filename. Is there a way to pull in xcom variable value into it? Looking at this link -DAG Dependencies, I see that this value is jinja templated i.e. can be modified using the variables. However, my use case is to have it configured via xcom pull. Can it be done?
When I put in the xcom_pull as suggested by the replies, I get syntax error ass shown below
You can use ti.xcom_pull to get xcom value that you pushed in previous task.
{{ ti.xcom_pull(task_ids='sales_taskid', key='filename_key') }}
XCom is identified by key, task_id and dag_id. For your requirement, you can use xcom_pull where you can provide task id and keys. XCom is used for small amounts of data and larger values are not allowed. If the task auto-pushes the results into the XCom key called return_value then, xcom_pull will use that as default value for keys. For more information, you can check this link.
If you are auto-pushing the results into key, you can use below code :
value = task_instance.xcom_pull(task_ids='pushing_task')
For trigger_dag_id value based on the filename, you can use it in template as below :
{{ task_instance.xcom_pull(task_ids='your_task_id', key='file_name') }}

DAG backfilling for many dates

I have a DAG that i need to backfill for many dates. Due to resources i think the best option is to create a list of dates between a start and end date and in a for loop, run the task.
Should i have a function that returns all the dates (after formatting) to a variable and run the DAG task in a for loop or should the list of dates be part of a function that runs as a task and then somehow uses Xcom to send the list of dates? How can this be assigned to a variable with xcom pull but not requiring a task?
Let me focus on the first part of the question, and not on the way you are suggesting to solve it:
I have a DAG that I need to backfill for many dates.
But you don't want to run all the dates at the same time because of limited resources!
I would look into utilizing the max_active_runs_per_dag configuration variable that airflow provides out of the box. You don't need to create any additional logic.
This way you can limit the backfill process to a certain dag runs in parallel. For example if I want to only have 2 dag runs at a time:
export AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=2; airflow backfill -s {start_date} -e {end_date} --reset-dagruns {dag_id}
Run with --dry-run if you want to get a feel of what's happening before you execute the actual backfill command.
Hope that helps.
1) You don't want your DAG to exhaust limited resources :
Limit your dag with max_active_runs or max_active_tasks
DAG(
"MyDAG",
start_date=datetime(2021, 1, 1),
schedule_interval="#hourly",
catchup=False,
tags=["benchmark"],
max_active_runs=1,
max_active_tasks=1
)
Limit your resource utilisation using pools
MyOperator(
task_id="MyTask",
...
pool="MyPool"
)
In admin/pool set the capacity to 1 or more
2) You wan't to programatically control your DAG :
Create a second DAG.
In this new DAG add a TriggerDagRunOperator
When using the TriggerDagRunOperator use the conf argument to carry information to your main DAG (in your case, you would carry dates)

Create dynamic tasks depending on the result of an sql query in airflow

I am trying to create dynamic tasks with TaskGroup saving the result in a variable. The variable is modified every N minutes depending on a database query but when the variable is modified the second time the scheduler breaks down
Basically I need to create tasks based on the number of unique rows that is received in the query.
with TaskGroup(f"task") as task:
data_variable = Variable.get("df")
data = data_variable
try :
if data != False and data !='none':
df = pd.read_json(data)
for field_id in df.field.unique():
task1 = PythonOperator(
)
task2 = PythonOperator(
)
task1 >> task2
except:
pass
Is there a way to do it with taskgroup for this?
For Airflow >=2.3.0:
Support for dynamic task creation has been added in AIP-42 Dynamic Task Mapping
You can read about this in the docs.
In simple words it added a map index option to tasks so a task can expand into different amount of indexes in every run.
For Airflow <2.3.0:
This is not supported.
While you can use Variable.get("df") at a top code you shouldn't do that. Variables / Connections / any other code that creates a query with any database should be done only inside operators scope or using Jinja templating. The reason for this is that Airflow parse the DAG file periodically (every 30 seconds if you didn't change default of min_file_process_interval ) thus having a code that interacts with the database every 30 seconds will cause heavy load on that database.
For some of these cases there will be a warning in future airflow versions (see PR)
Airflow tasks should be as static as possible (or slowly changing).

Multiple Airflow XCOMs

I am pushing multiple values to XCOM based on values returned from a database. As the number of values returned may vary, I am using the index as the key.
How do I, in the next task, retrieve all the values from the previous task. Currently I am only returning the last XCOM from t1 but would like all of them.
Here is the source code for xcom_pull.
You'll see it had some filter logic and defaulting behaviour. I believe you are doing xcom_pull()[-1] or equivalent. But you can use task_ids argument to provide a list in order of the explicit task_ids that you want to pull xcom data from. Alternatively, you can use the keys that you push the data up with.
So in your case, where you want all the data emitted from the last task_instance and that alone, you just need to pass the task_id of the relevant task to the xcom_pull method.

How to write/read time stamp from a variable in airflow?

I'm working with: EXEC_DATE = '{{ macros.ds_add(ds, 1) }}' This gives me the execution date but not the hour.
I want to be able to save this value as YYYY-MM-DD HH:MM into variable called process_last_run.
Basically read the variable in the beginning of the run and write to it in the end of the dag. This variable indicates what was the running time of the last dag.
How can I do that?
You can do this with the macro execution_date. However be advised that this is a poorly named concept in airflow. It represents the beginning of a scheduled interval period. It will not change within the same dag-run, even if the task is re-run manually. It's there to support idempotent data updates. Which frankly is the optimal way to approach data pipelines. In your case though you said elsewhere that your data fetching api takes a start date and provides all the data up to current, which isn't conducive to being processed idempotently, though you could throw away data after a cut off specified.
So instead you might just take the date after your processing of data has completed, and store that for later. You can store into an Airflow Variable. You might note though that the time you get out of a date command shown below is going to be later then the last time of the data you might have gotten from within your process_data api call for all data from a start date. So it might be better if your processing step outputs the actual last date and time of the data processed as the last line of stdout (which is captured by BashOperator for xcom).
E.G.
from airflow.models import Variable, DAG
from datetime import datetime
def pyop_fun(**context):
# You could have used execution_date here and in the next operator
# to make the operator rerun safe.
# date_string = context['execution_date'].strftime('%Y-%m-%d %H:%M')
# But elsewhere you said your api is always giving you the up-to-the-minute data.
# So maybe getting the date from the prior task would work better for you.
Variable.set(
'process_last_run',
context['task_instance'].xcom_pull(task_ids='process_data')
with dag as DAG(…):
pyop = PythonOperator(
task_id='set_process_last_run',
callable=pyop_fun,
provide_context=True, …)
shop = BashOperator(
task_id='process_data',
bash_command='''
process_data "{{var.value.process_last_run}}";
date -u +%Y-%m-%d\ %H:%M''',
xcom_push=True, …)
shop >> pyop
# Because the last output line of a BashOperator is pushed into xcom for that
# task id with the default key, it can be pulled by the PythonOperator and
# stored in a variable.
There's an {{ execution_date }} variable in Jinja you can use to get the execution date for the current DAG run.
More info: Airflow - Macros
If you're looking to track something like start time or end time of execution or duration (in seconds) of a particular task instance, that info is stored in the TaskInstance model.
class TaskInstance(Base, LoggingMixin):
...
start_date = Column(UtcDateTime)
end_date = Column(UtcDateTime)
duration = Column(Float)
https://github.com/apache/incubator-airflow/blob/4c30d402c4cd57dc56a5d9dcfe642eadc77ec3ba/airflow/models.py#L877-L879
Also, if you wanted to compute the running time of the entire DAG, you could get that from querying the Airflow metadata database around these fields for a particular DAG run.
If you're doing this in your Python code already, you can access the execution_date field on the task instance itself as well instead of using the template layer.
Variables
You can write to and read from Airflow Variables like so:
Variable.set('my_key', 'some value')
my_val = Variable.get('my_key')
You can also perform CRUD operations on variables with the CLI.
Stats
Another thing you might keep in mind if you find yourself working with stats like task duration a lot is Airflow's StatsD integration which gathers metrics on Airflow itself at execution time. You can have these metrics go into a push-based system like StatsD itself, or into a pull-based system like Prometheus / Grafana by using statsd_exporter.

Resources