How pull xcom variable from previous run - airflow

How pull xcom variable from previous run in airflow? Is it possible?
I want to use value from same task_id in previous run_id as jinja variable for data argument in SimpeHttpOperator.
I'm looking in macros docs https://airflow.apache.org/docs/stable/macros.html
and cant't find any documented way to do this.
UPD
Example:
select_expired = SimpleHttpOperator(
task_id='select_expired',
http_conn_id='clickhouse_http',
endpoint='/',
method='POST',
data=REQUESTED_EXPIRED_FLIGHTS,
xcom_push=True,
pool='clickhouse_select',
dag=dag
)
where REQUESTED_EXPIRED_FLIGHTS is:
insert into table where column = '{{ ??????? (value returned in previous task) }}'

You should be able to access the previous task_instance using previous_ti()
Then you can use get_state() to get its state, and perform actions based on that.

Related

I can't xcom_push an arguments through BashOperator

I am new to the Airflow's xcom feature. i tried it out with PythonOperator and it was working fine(i.e., i can push and pull the value out of the context), but when i tried it out on BashOperator, it didn't work. However i can pull only the final stdout statement by adding the xcom_push=True attribute during the task creation. that's one thing. 2) But i also wish to push and pull the values based on their keys (to and from the BashOp) like the way we do it in PythonOp.. It would be really helpful since i need to pass tons of variables from one script to another.
Is this what you want?
from datetime import datetime
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
dag = DAG(
dag_id="example_bash_operator_1",
schedule_interval=None,
start_date=datetime(2018, 12, 31),
)
t1 = BashOperator(
task_id="t1",
bash_command='echo "{{ ti.xcom_push(key="k1", value="v1") }}" "{{ti.xcom_push(key="k2", value="v2") }}"',
dag=dag,
)
t2 = BashOperator(
task_id="t2",
bash_command='echo "{{ ti.xcom_pull(key="k1") }}" "{{ ti.xcom_pull(key="k2") }}"',
dag=dag,
)
t1 >> t2
#SpaceyBot & Lucas answered your first question.
Regarding second question raised
Blockquote
2) But i also wish to push and pull the values based on their keys (to and from the BashOp) like the way we do it in PythonOp.. It would be really helpful since i need to pass tons of variables from one script to another.
Blockquote
this is not advisable. All XCom pull/push actions are translated to Insert/Select statements in airflow DB.
This will degrade the scheduler performance in time and slow down the whole processing either because of high number of pull(queries) run or the large amounts of rows retrieved which will be retrieved through Full Table scans instead of Index based scans.
So it's better to consider a different mechanism here - storing info in external json/csv/txt files/.. etc.
Bottom line - XCom is designed for transferring small amounts of data only, mostly counters and status variables.
In addition to #Ryan Yuan answer you can use the parameter env of the BashOperator to set environmental variables for your bash script/command.
my_task = BashOperator(
task_id='my_task',
bash_command='echo $VAR1 $VAR2',
env={
"VAR1": '{{ ti.xcom_pull(key="var1")}}',
"VAR2": '{{ ti.xcom_pull(key="var2")}}'
},
dag=dag
)

Airflow Get Retry Number

In my Airflow DAG I have a task that needs to know if it's the first time it's ran or if it's a retry run. I need to adjust my logic in the task if it's a retry attempt.
I have a few ideas on how I could store the number of retries for the task but I'm not sure if any of them are legitimate or if there's an easier built in way to get this information within the task.
I'm wondering if I can just have an integer variable inside the dag that I append every time the task runs. Then if the task if reran I could check the value of the variable to see that it's greater than 1 and hence would be a retry run. But I'm not sure if mutable global variables work that way in Airflow since there can be multiple workers for different tasks (I'm not sure on this though).
Write it in an XCOM variable?
The retry number is available from the task instance, which is available via the macro {{ task_instance }}. https://airflow.apache.org/code.html#default-variables
If you are using the python operator simply add provide_context=True, to your operator kwargs, and then in the callable do kwargs['task_instance'].try_number
Otherwise you can do something like:
t = BashOperator(
task_id='try_number_test',
bash_command='echo "{{ task_instance.try_number }}"',
dag=dag)
Edit:
When the task instance is cleared, it will set the max_retry number to be the current try_number + retry value. So you could do something like:
ti = # whatever method you do to get the task_instance object
is_first = ti.max_tries - ti.task.retries + 1 == ti.try_number
Airflow will increments the try_number by 1 when running, so I imagine you'd need the + 1 when subtracting the max_tries from the configured retry value. But I didn't test that to confirm
#cwurtz answer was spot on. I was able to use it like this:
def _get_actual_try_number(self, context):
'''
Returns the real try_number that you also see in task details or logs.
'''
return context['task_instance'].try_number
def _get_relative_try_number(self, context):
'''
When a task is cleared, the try_numbers continue to increment.
This returns the try number relative to the last clearing.
'''
ti = context['task_instance']
actual_try_number = self._get_actual_try_number(context)
# When the task instance is cleared, it will set the max_retry
# number to be the current try_number + retry value.
# From https://stackoverflow.com/a/51757521
relative_first_try = ti.max_tries - ti.task.retries + 1
return actual_try_number - relative_first_try + 1

How can I retrieve the 'scheduled time' for catchup jobs in Airflow?

When building an Airflow dag, I typically specify a simple schedule to run periodically - I expect this is the most common use.
dag = DAG('my_dag',
description='this is what it does',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 10, 1),
catchup=False)
I then need to use the 'date' as a parameter in my actual process, so I just check the current date.
date = datetime.date.today()
# do some date-sensitive stuff
operator = MyOperator(..., params=[date, ...])
My understanding is that setting catchup=True will have Airflow schedule my dag for every schedule interval between start_date and now (or end_date); e.g. every day.
How do I get the scheduled_date for use within my dag instance?
I think you mean execution date here, You can use Macros inside your operators, more detail can be found here: https://airflow.apache.org/code.html#macros. So airflow will respect it so you don't need to have your date been generated dynamically
Inside of Operator, you can call {{ ds }} in a str directly
Outside of Operator, for example PythonOperator, you will need provide_context=True first then to pass **kwargs as last arguments to your function then you can call kwargs['ds']

Status of Airflow task within the dag

I need the status of the task like if it is running or upforretry or failed within the same dag. So i tried to get it using the below code, though i got no output...
Auto = PythonOperator(
task_id='test_sleep',
python_callable=execute_on_emr,
op_kwargs={'cmd':'python /home/hadoop/test/testsleep.py'},
dag=dag)
logger.info(Auto)
The intention is to kill certain running tasks once a particular task on airflow completes.
Question is how do i get the state of a task like is it in the running state or failed or success
I am doing something similar. I need to check for one task if the previous 10 runs of another task were successful.
taky2 sent me on the right path. It is actually fairly easy:
from airflow.models import TaskInstance
ti = TaskInstance(*your_task*, execution_date)
state = ti.current_state()
As I want to check that within the dag, it is not neccessary to specify the dag.
I simply created a function to loop through the past n_days and check the status.
def check_status(**kwargs):
last_n_days = 10
for n in range(0,last_n_days):
date = kwargs['execution_date']- timedelta(n)
ti = TaskInstance(*my_task*, date) #my_task is the task you defined within the DAG rather than the task_id (as in the example below: check_success_task rather than 'check_success_days_before')
state = ti.current_state()
if state != 'success':
raise ValueError('Not all previous tasks successfully completed.')
When you call the function make sure to set provide_context.
check_success_task = PythonOperator(
task_id='check_success_days_before',
python_callable= check_status,
provide_context=True,
dag=dag
)
UPDATE:
When you want to call a task from another dag, you need to call it like this:
from airflow import configuration as conf
from airflow.models import DagBag, TaskInstance
dag_folder = conf.get('core','DAGS_FOLDER')
dagbag = DagBag(dag_folder)
check_dag = dagbag.dags[*my_dag_id*]
my_task = check_dag.get_task(*my_task_id*)
ti = TaskInstance(my_task, date)
Apparently there is also an api-call by now doing the same thing:
from airflow.api.common.experimental.get_task_instance import get_task_instance
ti = get_task_instance(*my_dag_id*, *my_task_id*, date)
Take a look at the code responsible for the command line interface operation suggested by Priyank.
https://github.com/apache/incubator-airflow/blob/2318cea74d4f71fba353eaca9bb3c4fd3cdb06c0/airflow/bin/cli.py#L581
def task_state(args):
dag = get_dag(args)
task = dag.get_task(task_id=args.task_id)
ti = TaskInstance(task, args.execution_date)
print(ti.current_state())
Hence, it seem you should easily be able to accomplish this within your DAG codebase using similar code.
Alternatively you could execute these CLI operations from within your code using python's subprocess library.
Okay, I think I know what you're doing and I don't really agree with it, but I'll start with an answer.
A straightforward, but hackish, way would be to query the task_instance table. I'm in postgres, but the structure should be the same. Start by grabbing the task_ids and state of the task you're interested in with a db call.
SELECT task_id, state
FROM task_instance
WHERE dag_id = '<dag_id_attrib>'
AND execution_date = '<execution_date_attrib>'
AND task_id = '<task_to_check>'
That should give you the state (and name, for reference) of the task you're trying to monitor. State is stored as a simple lowercase string.
You can use the command line Interface for this:
airflow task_state [-h] [-sd SUBDIR] dag_id task_id execution_date
For more on this you can refer official airflow documentation:
http://airflow.incubator.apache.org/cli.html

execution_date in airflow: need to access as a variable

I am really a newbie in this forum. But I have been playing with airflow, for sometime, for our company. Sorry if this question sounds really dumb.
I am writing a pipeline using bunch of BashOperators.
Basically, for each Task, I want to simply call a REST api using 'curl'
This is what my pipeline looks like(very simplified version):
from airflow import DAG
from airflow.operators import BashOperator, PythonOperator
from dateutil import tz
import datetime
datetime_obj = datetime.datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.datetime.combine(datetime_obj.today() - datetime.timedelta(1), datetime_obj.min.time()),
'email': ['xxxx#xxx.xxx'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 2,
'retry_delay': datetime.timedelta(minutes=5),
}
current_datetime = datetime_obj.now(tz=tz.tzlocal())
dag = DAG(
'test_run', default_args=default_args, schedule_interval=datetime.timedelta(minutes=60))
curl_cmd='curl -XPOST "'+hostname+':8000/run?st='+current_datetime +'"'
t1 = BashOperator(
task_id='rest-api-1',
bash_command=curl_cmd,
dag=dag)
If you notice I am doing current_datetime= datetime_obj.now(tz=tz.tzlocal())
Instead what I want here is 'execution_date'
How do I use 'execution_date' directly and assign it to a variable in my python file?
I have having this general issue of accessing args.
Any help will be genuinely appreciated.
Thanks
The BashOperator's bash_command argument is a template. You can access execution_date in any template as a datetime object using the execution_date variable. In the template, you can use any jinja2 methods to manipulate it.
Using the following as your BashOperator bash_command string:
# pass in the first of the current month
some_command.sh {{ execution_date.replace(day=1) }}
# last day of previous month
some_command.sh {{ execution_date.replace(day=1) - macros.timedelta(days=1) }}
If you just want the string equivalent of the execution date, ds will return a datestamp (YYYY-MM-DD), ds_nodash returns same without dashes (YYYYMMDD), etc. More on macros is available in the Api Docs.
Your final operator would look like:
command = """curl -XPOST '%(hostname)s:8000/run?st={{ ds }}'""" % locals()
t1 = BashOperator( task_id='rest-api-1', bash_command=command, dag=dag)
The PythonOperator constructor takes a 'provide_context' parameter (see https://pythonhosted.org/airflow/code.html). If it's True, then it passes a number of parameters into the python_callable via kwargs. kwargs['execution_date'] is what you want, I believe.
Something like this:
def python_method(ds, **kwargs):
Variable.set('execution_date', kwargs['execution_date'])
return
doit = PythonOperator(
task_id='doit',
provide_context=True,
python_callable=python_method,
dag=dag)
I'm not sure how to do it with the BashOperator, but you might start with this issue: https://github.com/airbnb/airflow/issues/775
I think you can't assign variables with values from the airflow context outside of a task instance, they are only available at run-time. Basically there are 2 different steps when a dag is loaded and executed in airflow :
First your dag file is interpreted and parsed. It has to work and compile and the task definitions must be correct (no syntax error or anything). During this step, if you make function calls to fill some values, these functions won't be able to access airflow context (the execution date for example, even more if you're doing some backfilling).
The second step is the execution of the dag. It's only during this second step that the variables provided by airflow (execution_date, ds, etc...) are available as they are related to an execution of the dag.
So you can't initialize global variables using the Airflow context, however, Airflow gives you multiple mechanisms to achieve the same effect :
Using jinja template in your command (it can be in a string in the code or in a file, both will be processed). You have the list of available templates here : https://airflow.apache.org/macros.html#default-variables. Note that some functions are also available, particularly for computing days delta and date formatting.
Using a PythonOperator in which you pass the context (with the provide_context argument). This will allow you to access the same template with the syntax kwargs['<variable_name']. If you need so, you can return a value from a PythonOperator, this one will be stored in an XCOM variable you can use later in any template. Access to XCOM variables use this syntax : https://airflow.apache.org/concepts.html#xcoms
If you write your own operator, you can access airflow variables with the dict context.
def execute(self, context):
execution_date = context.get("execution_date")
This should be inside the execute() method of Operator
To print execution date inside the callable function of your PythonOperator you can use the following in your Airflow Script and also can add start_time and end_time as follows:
def python_func(**kwargs):
execution_date = kwargs["execution_date"] #<datetime> type with timezone
end_time = str(execution_date)
start_time = str(execution_date.add(minutes=-30))
I have converted the datetime value to string as I need to pass it in a SQL Query. We can use it otherwise also.
You may consider SimpleHttpOperator https://airflow.apache.org/_api/airflow/operators/http_operator/index.html#airflow.operators.http_operator.SimpleHttpOperator. It’s so simple for making http request. you can pass execution_date with endpoint parameter via template.
Here's another way without context. using the dag's last execution time can be very helpful in scheduled ETL jobs. Such as a dag that 'downloads all newly added files'. Instead of hardcoding a datetime.datetime, use the dag's last execution date as your time filter.
Airflow Dags actually have a class called DagRun that can be accessed like so: dag_runs = DagRun.find(dag_id=dag_id)
Here's an easy way to get the most recent run's execution time:
def get_most_recent_dag_run(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
return dag_runs[1] if len(dag_runs) > 1 else None
Then, within your pythonOperator, you can dynamically access the dag's last execution by calling the function you created above:
last_execution = get_most_recent_dag_run('dag')
Now its a variable!

Resources