execution_date in airflow: need to access as a variable - airflow

I am really a newbie in this forum. But I have been playing with airflow, for sometime, for our company. Sorry if this question sounds really dumb.
I am writing a pipeline using bunch of BashOperators.
Basically, for each Task, I want to simply call a REST api using 'curl'
This is what my pipeline looks like(very simplified version):
from airflow import DAG
from airflow.operators import BashOperator, PythonOperator
from dateutil import tz
import datetime
datetime_obj = datetime.datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.datetime.combine(datetime_obj.today() - datetime.timedelta(1), datetime_obj.min.time()),
'email': ['xxxx#xxx.xxx'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 2,
'retry_delay': datetime.timedelta(minutes=5),
}
current_datetime = datetime_obj.now(tz=tz.tzlocal())
dag = DAG(
'test_run', default_args=default_args, schedule_interval=datetime.timedelta(minutes=60))
curl_cmd='curl -XPOST "'+hostname+':8000/run?st='+current_datetime +'"'
t1 = BashOperator(
task_id='rest-api-1',
bash_command=curl_cmd,
dag=dag)
If you notice I am doing current_datetime= datetime_obj.now(tz=tz.tzlocal())
Instead what I want here is 'execution_date'
How do I use 'execution_date' directly and assign it to a variable in my python file?
I have having this general issue of accessing args.
Any help will be genuinely appreciated.
Thanks

The BashOperator's bash_command argument is a template. You can access execution_date in any template as a datetime object using the execution_date variable. In the template, you can use any jinja2 methods to manipulate it.
Using the following as your BashOperator bash_command string:
# pass in the first of the current month
some_command.sh {{ execution_date.replace(day=1) }}
# last day of previous month
some_command.sh {{ execution_date.replace(day=1) - macros.timedelta(days=1) }}
If you just want the string equivalent of the execution date, ds will return a datestamp (YYYY-MM-DD), ds_nodash returns same without dashes (YYYYMMDD), etc. More on macros is available in the Api Docs.
Your final operator would look like:
command = """curl -XPOST '%(hostname)s:8000/run?st={{ ds }}'""" % locals()
t1 = BashOperator( task_id='rest-api-1', bash_command=command, dag=dag)

The PythonOperator constructor takes a 'provide_context' parameter (see https://pythonhosted.org/airflow/code.html). If it's True, then it passes a number of parameters into the python_callable via kwargs. kwargs['execution_date'] is what you want, I believe.
Something like this:
def python_method(ds, **kwargs):
Variable.set('execution_date', kwargs['execution_date'])
return
doit = PythonOperator(
task_id='doit',
provide_context=True,
python_callable=python_method,
dag=dag)
I'm not sure how to do it with the BashOperator, but you might start with this issue: https://github.com/airbnb/airflow/issues/775

I think you can't assign variables with values from the airflow context outside of a task instance, they are only available at run-time. Basically there are 2 different steps when a dag is loaded and executed in airflow :
First your dag file is interpreted and parsed. It has to work and compile and the task definitions must be correct (no syntax error or anything). During this step, if you make function calls to fill some values, these functions won't be able to access airflow context (the execution date for example, even more if you're doing some backfilling).
The second step is the execution of the dag. It's only during this second step that the variables provided by airflow (execution_date, ds, etc...) are available as they are related to an execution of the dag.
So you can't initialize global variables using the Airflow context, however, Airflow gives you multiple mechanisms to achieve the same effect :
Using jinja template in your command (it can be in a string in the code or in a file, both will be processed). You have the list of available templates here : https://airflow.apache.org/macros.html#default-variables. Note that some functions are also available, particularly for computing days delta and date formatting.
Using a PythonOperator in which you pass the context (with the provide_context argument). This will allow you to access the same template with the syntax kwargs['<variable_name']. If you need so, you can return a value from a PythonOperator, this one will be stored in an XCOM variable you can use later in any template. Access to XCOM variables use this syntax : https://airflow.apache.org/concepts.html#xcoms
If you write your own operator, you can access airflow variables with the dict context.

def execute(self, context):
execution_date = context.get("execution_date")
This should be inside the execute() method of Operator

To print execution date inside the callable function of your PythonOperator you can use the following in your Airflow Script and also can add start_time and end_time as follows:
def python_func(**kwargs):
execution_date = kwargs["execution_date"] #<datetime> type with timezone
end_time = str(execution_date)
start_time = str(execution_date.add(minutes=-30))
I have converted the datetime value to string as I need to pass it in a SQL Query. We can use it otherwise also.

You may consider SimpleHttpOperator https://airflow.apache.org/_api/airflow/operators/http_operator/index.html#airflow.operators.http_operator.SimpleHttpOperator. It’s so simple for making http request. you can pass execution_date with endpoint parameter via template.

Here's another way without context. using the dag's last execution time can be very helpful in scheduled ETL jobs. Such as a dag that 'downloads all newly added files'. Instead of hardcoding a datetime.datetime, use the dag's last execution date as your time filter.
Airflow Dags actually have a class called DagRun that can be accessed like so: dag_runs = DagRun.find(dag_id=dag_id)
Here's an easy way to get the most recent run's execution time:
def get_most_recent_dag_run(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
return dag_runs[1] if len(dag_runs) > 1 else None
Then, within your pythonOperator, you can dynamically access the dag's last execution by calling the function you created above:
last_execution = get_most_recent_dag_run('dag')
Now its a variable!

Related

modify dag runs when triggered

I was curious if there's a way to customise the dag runs.
So I'm currently checking for updates for another table which gets updated manually by someone and once that's been updated, I would run my dag for the month.
At the moment I have created a branch operator that compares the dates of the 2 tables but is there a way to run the dag (compare the two dates) and run it everyday until there is a change and not run for the remaining of the month?
For example,
Table A (that is updated manually) has YYYYMM as 202209 and Table B also has YYYYMM as 202209.
Atm, my branch operator compares the two YYYYMM and would point to a dummy operator end when it's the same. However, when Table A has been updated to 202210, there's a difference in the two YYYYMM hence another task would run and overwrite Table B.
It all works but this would run the dag everyday even though the table A only gets updated once a month at a random point of time within the month. So is there way to trigger the dag to stop for the remaining days of the month after the task has been triggered?
Hope this is clear.
If you would be using data stored on S3 there would be easy solution starting from the version 2.4 - the Data-aware scheduling.
But probably you're not so there is another option.
A dag in Airflow is Dag object that is assigned to global scope. This allows for dynamic creation of dags. This implies each file is loaded on certain interval. A very good description with examples is here
Second thing you need to use is Airflow Variables
So the concept is as follows:
Create a variable in Airflow named dag_run that will hold the month when the dag has successfully run
Create a python file that has a function that creates a dag object based on input parameters.
In the same file use conditional statements that will set the 'schedule' param differently depending if the dag has run for current month
In your dag in the branch that executes when data has changed set the variable dag_run to the current months value like so: Variable.set(key='dag_run', value=datetime.now().month)
step 1:
python code:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
from airflow.models import Variable
#function that creates dag based on input
def create_dag(dag_id,
schedule,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_id)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py)
return dag
#run some checks
current_month = datetime.now().month
dag_run_month = int(Variable.get('run_month'))
if current_month == dag_run_month:
# keep the schedule off
schedule = None
dag_id = "Database_insync"
elif current_month != dag_run_month:
# keep the schedule on
schedule = "30 * * * *"
dag_id = "Database_notsynced"
#watch out for start_date if you leave
#it in the past airflow will execute past missing schedules
default_args = {'owner': 'airflow',
'start_date': datetime.datetime.now() - datetime.timedelta(minutes=15)
}
globals()[dag_id] = create_dag(dag_id,
schedule,
default_args)

Dynamic tasks in airflow based on an external file

I am reading list of elements from an external file and looping over elements to create a series of tasks.
For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:
A1 -> A2 ..
B1 -> B2 ...
This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.
Wondering if there is already a pattern for such kind of use cases?
Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.
A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Task definition:
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
Variable.set(key='list_of_cities',
value=data['cities'], serialize_json=True)
print('Loading Variable from file...')
def _say_hello(city_name):
print('hello from ' + city_name)
with DAG('dynamic_tasks_from_var', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
read_file = PythonOperator(
task_id='read_file',
python_callable=_read_file
)
Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.
# Top-level code
updated_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
print(f'Updated LIST: {updated_list}')
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
for index, city in enumerate(updated_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_say_hello,
op_kwargs={'city_name': city}
)
# DAG level dependencies
read_file >> dynamic_tasks_group
In the Scheduler logs, you will only find:
INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']
Dag Graph View:
With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Update:
As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
Does it work? Yes, totally. Is it production quality code? No.
What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.

How to log inside Python function in PythonOperator in Airflow

I am using a PythonOperator in my Airflow DAG and I need to print something inside the operator's Python function. I tried to print but apparently it didn't work out. Not so sure that's going to work. Next I tried to pass self.log in PythonOperator but I am not sure how to pass that reference.
task = PythonOperator(
task_id='task1',
python_callable=my_func,
params={ ... },
provide_context=True,
dag=dag
)
...
def my_func(**context):
...
print(some_message) # this didn't work.
The way you should be logging in Airflow (and in Python in general) is through the logging module, that is,
import logging
on top of the DAG definition and
logging.info(some_message)
in place of the print statement in your my_func function. Other functions you can use besides info() (with different logging level/criticality) can be found in the Python docs: https://docs.python.org/3/howto/logging.html#logging-basic-tutorial

How can I retrieve the 'scheduled time' for catchup jobs in Airflow?

When building an Airflow dag, I typically specify a simple schedule to run periodically - I expect this is the most common use.
dag = DAG('my_dag',
description='this is what it does',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 10, 1),
catchup=False)
I then need to use the 'date' as a parameter in my actual process, so I just check the current date.
date = datetime.date.today()
# do some date-sensitive stuff
operator = MyOperator(..., params=[date, ...])
My understanding is that setting catchup=True will have Airflow schedule my dag for every schedule interval between start_date and now (or end_date); e.g. every day.
How do I get the scheduled_date for use within my dag instance?
I think you mean execution date here, You can use Macros inside your operators, more detail can be found here: https://airflow.apache.org/code.html#macros. So airflow will respect it so you don't need to have your date been generated dynamically
Inside of Operator, you can call {{ ds }} in a str directly
Outside of Operator, for example PythonOperator, you will need provide_context=True first then to pass **kwargs as last arguments to your function then you can call kwargs['ds']

Status of Airflow task within the dag

I need the status of the task like if it is running or upforretry or failed within the same dag. So i tried to get it using the below code, though i got no output...
Auto = PythonOperator(
task_id='test_sleep',
python_callable=execute_on_emr,
op_kwargs={'cmd':'python /home/hadoop/test/testsleep.py'},
dag=dag)
logger.info(Auto)
The intention is to kill certain running tasks once a particular task on airflow completes.
Question is how do i get the state of a task like is it in the running state or failed or success
I am doing something similar. I need to check for one task if the previous 10 runs of another task were successful.
taky2 sent me on the right path. It is actually fairly easy:
from airflow.models import TaskInstance
ti = TaskInstance(*your_task*, execution_date)
state = ti.current_state()
As I want to check that within the dag, it is not neccessary to specify the dag.
I simply created a function to loop through the past n_days and check the status.
def check_status(**kwargs):
last_n_days = 10
for n in range(0,last_n_days):
date = kwargs['execution_date']- timedelta(n)
ti = TaskInstance(*my_task*, date) #my_task is the task you defined within the DAG rather than the task_id (as in the example below: check_success_task rather than 'check_success_days_before')
state = ti.current_state()
if state != 'success':
raise ValueError('Not all previous tasks successfully completed.')
When you call the function make sure to set provide_context.
check_success_task = PythonOperator(
task_id='check_success_days_before',
python_callable= check_status,
provide_context=True,
dag=dag
)
UPDATE:
When you want to call a task from another dag, you need to call it like this:
from airflow import configuration as conf
from airflow.models import DagBag, TaskInstance
dag_folder = conf.get('core','DAGS_FOLDER')
dagbag = DagBag(dag_folder)
check_dag = dagbag.dags[*my_dag_id*]
my_task = check_dag.get_task(*my_task_id*)
ti = TaskInstance(my_task, date)
Apparently there is also an api-call by now doing the same thing:
from airflow.api.common.experimental.get_task_instance import get_task_instance
ti = get_task_instance(*my_dag_id*, *my_task_id*, date)
Take a look at the code responsible for the command line interface operation suggested by Priyank.
https://github.com/apache/incubator-airflow/blob/2318cea74d4f71fba353eaca9bb3c4fd3cdb06c0/airflow/bin/cli.py#L581
def task_state(args):
dag = get_dag(args)
task = dag.get_task(task_id=args.task_id)
ti = TaskInstance(task, args.execution_date)
print(ti.current_state())
Hence, it seem you should easily be able to accomplish this within your DAG codebase using similar code.
Alternatively you could execute these CLI operations from within your code using python's subprocess library.
Okay, I think I know what you're doing and I don't really agree with it, but I'll start with an answer.
A straightforward, but hackish, way would be to query the task_instance table. I'm in postgres, but the structure should be the same. Start by grabbing the task_ids and state of the task you're interested in with a db call.
SELECT task_id, state
FROM task_instance
WHERE dag_id = '<dag_id_attrib>'
AND execution_date = '<execution_date_attrib>'
AND task_id = '<task_to_check>'
That should give you the state (and name, for reference) of the task you're trying to monitor. State is stored as a simple lowercase string.
You can use the command line Interface for this:
airflow task_state [-h] [-sd SUBDIR] dag_id task_id execution_date
For more on this you can refer official airflow documentation:
http://airflow.incubator.apache.org/cli.html

Resources