How to write/read time stamp from a variable in airflow? - airflow

I'm working with: EXEC_DATE = '{{ macros.ds_add(ds, 1) }}' This gives me the execution date but not the hour.
I want to be able to save this value as YYYY-MM-DD HH:MM into variable called process_last_run.
Basically read the variable in the beginning of the run and write to it in the end of the dag. This variable indicates what was the running time of the last dag.
How can I do that?

You can do this with the macro execution_date. However be advised that this is a poorly named concept in airflow. It represents the beginning of a scheduled interval period. It will not change within the same dag-run, even if the task is re-run manually. It's there to support idempotent data updates. Which frankly is the optimal way to approach data pipelines. In your case though you said elsewhere that your data fetching api takes a start date and provides all the data up to current, which isn't conducive to being processed idempotently, though you could throw away data after a cut off specified.
So instead you might just take the date after your processing of data has completed, and store that for later. You can store into an Airflow Variable. You might note though that the time you get out of a date command shown below is going to be later then the last time of the data you might have gotten from within your process_data api call for all data from a start date. So it might be better if your processing step outputs the actual last date and time of the data processed as the last line of stdout (which is captured by BashOperator for xcom).
E.G.
from airflow.models import Variable, DAG
from datetime import datetime
def pyop_fun(**context):
# You could have used execution_date here and in the next operator
# to make the operator rerun safe.
# date_string = context['execution_date'].strftime('%Y-%m-%d %H:%M')
# But elsewhere you said your api is always giving you the up-to-the-minute data.
# So maybe getting the date from the prior task would work better for you.
Variable.set(
'process_last_run',
context['task_instance'].xcom_pull(task_ids='process_data')
with dag as DAG(…):
pyop = PythonOperator(
task_id='set_process_last_run',
callable=pyop_fun,
provide_context=True, …)
shop = BashOperator(
task_id='process_data',
bash_command='''
process_data "{{var.value.process_last_run}}";
date -u +%Y-%m-%d\ %H:%M''',
xcom_push=True, …)
shop >> pyop
# Because the last output line of a BashOperator is pushed into xcom for that
# task id with the default key, it can be pulled by the PythonOperator and
# stored in a variable.

There's an {{ execution_date }} variable in Jinja you can use to get the execution date for the current DAG run.
More info: Airflow - Macros
If you're looking to track something like start time or end time of execution or duration (in seconds) of a particular task instance, that info is stored in the TaskInstance model.
class TaskInstance(Base, LoggingMixin):
...
start_date = Column(UtcDateTime)
end_date = Column(UtcDateTime)
duration = Column(Float)
https://github.com/apache/incubator-airflow/blob/4c30d402c4cd57dc56a5d9dcfe642eadc77ec3ba/airflow/models.py#L877-L879
Also, if you wanted to compute the running time of the entire DAG, you could get that from querying the Airflow metadata database around these fields for a particular DAG run.
If you're doing this in your Python code already, you can access the execution_date field on the task instance itself as well instead of using the template layer.
Variables
You can write to and read from Airflow Variables like so:
Variable.set('my_key', 'some value')
my_val = Variable.get('my_key')
You can also perform CRUD operations on variables with the CLI.
Stats
Another thing you might keep in mind if you find yourself working with stats like task duration a lot is Airflow's StatsD integration which gathers metrics on Airflow itself at execution time. You can have these metrics go into a push-based system like StatsD itself, or into a pull-based system like Prometheus / Grafana by using statsd_exporter.

Related

Passing Trigger DAG Id value xcom

I am building a parent dag that basically does a conditional trigger of another dag depending on a xcom value.
Say, I have the following dags in my system
Load Contracts
Load Budget
Load Sales
Trigger Data Loads (Master)
In my master DAG, I have two tasks - (1) Check File Name (2) Trigger appropriate dag based on file name. If File Name is Sales, then trigger 'Load Sales' and like wise for Budget / Contracts.
Here is how my TriggerDAGRunOperator is configure
def _check_file_version(**context):
context["task_instance"].xcom_push(key="file_name",value="Load Sales")
with dag:
completion_file_check = PythonOperator(
task_id="completion_file_check"
python_callable=_check_file_version,
provide_context=True
)
trigger_dependent_dag = TriggerDagRunOperator(
task_id = "trigger_dependent_dag",
provide_context=True,
trigger_dag_id={{ task_instance.xcom_pull(task_ids='completion_file_check', key='file_name') }},
wait_for_completion=True
)
I want to modify the trigger_dag_id value based on the filename. Is there a way to pull in xcom variable value into it? Looking at this link -DAG Dependencies, I see that this value is jinja templated i.e. can be modified using the variables. However, my use case is to have it configured via xcom pull. Can it be done?
When I put in the xcom_pull as suggested by the replies, I get syntax error ass shown below
You can use ti.xcom_pull to get xcom value that you pushed in previous task.
{{ ti.xcom_pull(task_ids='sales_taskid', key='filename_key') }}
XCom is identified by key, task_id and dag_id. For your requirement, you can use xcom_pull where you can provide task id and keys. XCom is used for small amounts of data and larger values are not allowed. If the task auto-pushes the results into the XCom key called return_value then, xcom_pull will use that as default value for keys. For more information, you can check this link.
If you are auto-pushing the results into key, you can use below code :
value = task_instance.xcom_pull(task_ids='pushing_task')
For trigger_dag_id value based on the filename, you can use it in template as below :
{{ task_instance.xcom_pull(task_ids='your_task_id', key='file_name') }}

DAG backfilling for many dates

I have a DAG that i need to backfill for many dates. Due to resources i think the best option is to create a list of dates between a start and end date and in a for loop, run the task.
Should i have a function that returns all the dates (after formatting) to a variable and run the DAG task in a for loop or should the list of dates be part of a function that runs as a task and then somehow uses Xcom to send the list of dates? How can this be assigned to a variable with xcom pull but not requiring a task?
Let me focus on the first part of the question, and not on the way you are suggesting to solve it:
I have a DAG that I need to backfill for many dates.
But you don't want to run all the dates at the same time because of limited resources!
I would look into utilizing the max_active_runs_per_dag configuration variable that airflow provides out of the box. You don't need to create any additional logic.
This way you can limit the backfill process to a certain dag runs in parallel. For example if I want to only have 2 dag runs at a time:
export AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=2; airflow backfill -s {start_date} -e {end_date} --reset-dagruns {dag_id}
Run with --dry-run if you want to get a feel of what's happening before you execute the actual backfill command.
Hope that helps.
1) You don't want your DAG to exhaust limited resources :
Limit your dag with max_active_runs or max_active_tasks
DAG(
"MyDAG",
start_date=datetime(2021, 1, 1),
schedule_interval="#hourly",
catchup=False,
tags=["benchmark"],
max_active_runs=1,
max_active_tasks=1
)
Limit your resource utilisation using pools
MyOperator(
task_id="MyTask",
...
pool="MyPool"
)
In admin/pool set the capacity to 1 or more
2) You wan't to programatically control your DAG :
Create a second DAG.
In this new DAG add a TriggerDagRunOperator
When using the TriggerDagRunOperator use the conf argument to carry information to your main DAG (in your case, you would carry dates)

Create dynamic tasks depending on the result of an sql query in airflow

I am trying to create dynamic tasks with TaskGroup saving the result in a variable. The variable is modified every N minutes depending on a database query but when the variable is modified the second time the scheduler breaks down
Basically I need to create tasks based on the number of unique rows that is received in the query.
with TaskGroup(f"task") as task:
data_variable = Variable.get("df")
data = data_variable
try :
if data != False and data !='none':
df = pd.read_json(data)
for field_id in df.field.unique():
task1 = PythonOperator(
)
task2 = PythonOperator(
)
task1 >> task2
except:
pass
Is there a way to do it with taskgroup for this?
For Airflow >=2.3.0:
Support for dynamic task creation has been added in AIP-42 Dynamic Task Mapping
You can read about this in the docs.
In simple words it added a map index option to tasks so a task can expand into different amount of indexes in every run.
For Airflow <2.3.0:
This is not supported.
While you can use Variable.get("df") at a top code you shouldn't do that. Variables / Connections / any other code that creates a query with any database should be done only inside operators scope or using Jinja templating. The reason for this is that Airflow parse the DAG file periodically (every 30 seconds if you didn't change default of min_file_process_interval ) thus having a code that interacts with the database every 30 seconds will cause heavy load on that database.
For some of these cases there will be a warning in future airflow versions (see PR)
Airflow tasks should be as static as possible (or slowly changing).

How can I author a DAG which has a dynamic set of tasks relative to the execution date?

We have a DAG which pulls in some data from an ad platform. These ads are organized into campaigns. Our goal is to pull in the high-level metrics for these campaigns. To do so we first need to get the list of active campaigns for the given execution date--fortunately the ad platform's API makes this trivial, provided we know the time range we'd like to inquire about.
Currently our DAG is structured to go and fetch these campaigns and then to store them in S3 and finally Redshift. We then query Redshift before setting up the subsequent tasks which pull the data for each campaign. This is the gross part. We could also look in S3, but the trouble is the keys are templated with the value of the ds macro. There doesn't seem to be a way to know that value when constructing the DAG itself.
Our current approach also isn't aware of the execution date so it always queries all campaigns even if those campaigns aren't active for the time period we're interested in.
To make this a little more concrete, here's what that DAG looks like today:
Another approach would be to roll this all up into a single operator that encapsulates getting the set of campaigns for the current execution date and then getting the metrics for each of those campaigns. We avoided this because that seems to preclude pulling the data in parallel via separate tasks per campaign.
How can we author this DAG such that we maintain the parallelization offered by dynamically querying the Redshift tables for campaigns but the campaigns are correctly constrained to the execution date?
I don't believe this is possible. The DAG can only render in one configuration defined by the DAG's python definition. You won't be able to control which version of the DAG renders as a function of execution date, so you won't be able to look back at how a DAG should render in the past, for instance. If you want the current DAG to render based on execution date then you can possibly write some logic in your DAG's python definition.
Depending on how you orchestrate your Airflow jobs, you may be able to have a single operator as you described, but have that single operator then kick off parallel queries on Redshift and terminate when all queries are complete.
A caveat, in the interest of time I am going to piece together ideas and code examples from third party sources. I will give credit to those sources so you can take a look at context and documentation. An additional caveat, I have not been able to test this, but I am 99% certain this will work.
The tricky part of this whole operation will be figuring out how to handle your campaigns that might have ended and have started back up. Airflow is not going to like a DAG with a moving start or stop date. Moving the stop date might work a little better, moving the start date for the dag does not work at all. That said, if there are campaigns that get extended you should be able to move the end date as long as there are no gaps in continuity. If you have a campaign that lapses and then gets extended with a couple of non-active days in between you will probably want to figure out how to make those two look like unique campaigns to airflow.
First step
You will want to create a python script that will call your database and return the relevant details from your campaigns. Assuming it is in MySQL it will look something like this, an example connection from PyMySQL pip package documentation:
import pymysql.cursors
# Connect to the database
connection = pymysql.connect(host='localhost',
user='user',
password='passwd',
db='db',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Create a new record
sql = "INSERT INTO `users` (`email`, `password`) VALUES (%s, %s)"
cursor.execute(sql, ('webmaster#python.org', 'very-secret'))
# connection is not autocommit by default. So you must commit to save
# your changes.
connection.commit()
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `id`, `password` FROM `users` WHERE `email`=%s"
cursor.execute(sql, ('webmaster#python.org',))
result = cursor.fetchall()
finally:
connection.close()
Second step
You will want to iterate through that cursor and create your dags dynamically similar to this example from Astronomer.io:
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def create_dag(dag_id,
schedule,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
# build a dag for each number in range(10)
for campaign in result: # This is the pymysql result from above
dag_id = 'hello_world_{}'.format(str(n))
default_args = {'owner': 'airflow',
'start_date': datetime(2018, 1, 1)
}
schedule = '#daily'
dag_number = n
globals()[dag_id] = create_dag(dag_id,
schedule,
dag_number,
default_args)
If you house all of this code in a single file, it will need to go in your dags folder. When a new campaign shows up in your database you will create a dag from it and can use your subdag architecture to run exactly the same set of steps/tasks with parameters pulled from that MySQL database. To be safe, and keep recent campaigns in your dag list I would write the mysql query with a date buffer. This way you still have dags that have ended recently in your list. The day these dags end you should populate the end_date argument of the dag.

Airflow - Variables among task

How do I create a variable in DAG level and pass on to multiple task?
For example :
cluster_name = 'data-' + datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")
I have to use the above variable cluster_name in all task. but I see value keep changing. I do not want to use xcom. Please advise
This value will change all the time because the DAG definition is being parsed repeatedly by the scheduler/webserver/workers, and datetime.now() will return different values every time it is parsed.
I highly recommend against using dynamic task names.
The date is already part of a task in the sense that the execution date is part of what makes each run of the task unique.
Each task instance can be identified by: dag_id + task_id + execution_date
To uniquely identify the tasks, use these things instead of bundling the date inside the name.
You can store it in an Airflow Variable and it should be accessible to all your tasks. Just note that it is a database call each time you look up a Variable.

Resources