What is task_instance.xcom_pull in AIrflow? - airflow

I am trying to run EMR through Airflow and found example where it says
step_adder = EmrAddStepsOperator(
task_id='add_steps',
job_flow_id="{{ task_instance.xcom_pull(task_ids='create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=SPARK_STEPS,
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
step_id="{{ task_instance.xcom_pull(task_ids='add_steps', key='return_value')[0] }}",
aws_conn_id='aws_default',
)
what is job_flow_id ={{ task_instance.xcom_pull('create_job_flow', key='return_value') }}
what does this tell me?
Thanks,
Xi

In Airflow tasks can not share data however they can share metadata. This is done by a task writing record to the Xcom table in the database while other task read it.
task_instance.xcom_pull('create_job_flow', key='return_value')) means:
Go to the Xcom table
find the row matched to this DagRun and task_id='create_job_flow'
return the entry saved under key='return_value'
The {{ }} is syntax of Jinja engine that means "print" the value. This is needed since the value that you are seeking exist only during run time. In terms that create_job_flow task must run and save the value to the database before add_steps task can read the value.
In practice this means that the create_job_flow task is creating EMR instance and saving the instance/machine id to the Xcom table. The next task is add_steps which means that you want to submit steps to the machine - for that you need the machine id thus you must read (pull) the value from the Xcom table. The value will be different per DagRun as each DagRun creates a new machine.

Related

How to execute the same query for several dynamically decided dates in a single DAG run?

The main idea is to run BigQueryOperator for several specific dynamically determined dates (so the date must be passed to an external query) and that the results are written to corresponding partitions (using the $ suffix of the destination table) all in a single DAG run. Dates are dependent on the execution_date.
I understand that user_defined_macros only exist on (sub)DAG level, so I will have to spawn subDAGs dynamically. But execution_date is only available within operators, such as PythonOperator, and it does not seem possible to spawn a subDAG (or any operator for that matter) from inside another operator.
So either I need a way to access execution_date not from an operator but from within a DAG itself, or an alternative way of passing my custom date to the external query without user_defined_macros on (sub)DAG level.
Is there a standard way (or any way) of dealing with similar situations?
Ok, apparently it is possible to pass Jinja-readable parameters to the BigQueryOperator to be used both in the query and in partition suffix, like so -
def bq_to_bq_fn(day_offset=None):
return BigQueryOperator(
task_id='bq_to_bq' if day_offset is None else f'bq_to_bq_{day_offset}',
use_legacy_sql=False,
write_disposition=write_mode_dag,
allow_large_results=True,
params={
'day_offset': 0 if day_offset is None else day_offset
},
sql='''
SELECT * FROM TEST.test1
WHERE date = '{{ (execution_date - macros.timedelta(days=params.day_offset)).strftime("%Y-%m-%d") }}'
''',
time_partitioning={"type": "DAY", "field": "date"},
destination_dataset_table=project_id + '.TEST.test2$' + '{{ (execution_date - macros.timedelta(days=params.day_offset)).strftime("%Y%m%d") }}',
dag=dag
)
The SQL part can also be in an external file.

Can I create an adhoc parameterized DAG from scheduler job DAG running once a minute

I'm researching Airflow to see if it is a viable fit for my use case and not clear from the documentation if it fits this scenario. I'd like to schedule a job workflow per customer based on some very dynamic criteria which doesn't fall into the standard "CRON" loop of running every X minutes etc. (since there is some impact of running together)
Customer DB
Customer_id, "CRON" desired interval (best case)
1 , 30 minutes
2 , 2 hours
...
... <<<<<<< thousands of these potential jobs
Every minute I'd like to query the state of the current work in the system as well as real world "sensor" data which changes often (such as load on some DBs, or quotas to other resources, or adhoc priorities, boosting etc.)
When decided, I'd need to create a "DAG" (pipeline) of work per customer which had been deemed worthy of running at this time (since perhaps we want to delay work for the "CRON" given some complicated analysis).
For instance :
Every minute run this test:
for customer in DB:
if (shouldRunDAGForCustomer(customer)):
Create a DAG with states ..... and run it
"Sleep for a minute"
def shouldRunDagForCustomer(...):
currentStateOfJobs = ....
situationalAwarenessOfWorld = .... // check all sort of interesting metrics
if some criteria is met for this customer: return true // run the dag for this customer
else : return false
From the material I've read, it seems that the Dags are given a specifed schedule and are static in their structure. Also seems that DAGs run on all their inputs, not generated per input.
Also wasn't clear on how the scheduling works, if the given DAG hadn't completed but the schedule time had arrived. Would I have potentially multiple runs of the same pipeline for the same input (Bad). As I have pipelines whose time to complete varies depending on customer, dynamic load of system etc. I'd like to manage the scheduling aspect myself and generation of "DAG"
This is possible through a "controller" DAG that is scheduled every minute, which then triggers runs for a "target" DAG when desired conditions are met. Airflow has a good example of this, see example_trigger_controller_dag and example_trigger_target_dag. The controller uses the TriggerDagRunOperator() which is an operator that kicks off a run for any desired DAG.
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="example_trigger_target_dag", # Ensure this equals the dag_id of the DAG to trigger
conf={"message": "Hello World"},
dag=dag,
)
Then the target DAG doesn't need to do anything special except should have schedule_interval=None. Note that on trigger, you can populate a conf dictionary that the target can later consume, in case you want to customize each triggered run.
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: \'{{ dag_run.conf["message"] if dag_run else "" }}\'"',
dag=dag,
)
Back to your case, your scenario is similar, but where you differ from the example is you won't kick off a DAG every time and you have multiple target DAGs that you could kick off. This is where the ShortCircuitOperator comes into play, which basically is a task that runs a python method you specify, which just needs to return true or false. If it returns true, then it continues onto the next downstream task as usual, otherwise it "short circuits" and stops skips the downstream task. Worth giving example_short_circuit_operator a run if you want to see this demonstrated. With that and dynamic creation of tasks with a for-loop, I think you'll get something like this in your controller DAG:
dag = DAG(
dag_id='controller_customer_pipeline',
default_args=args,
schedule_interval='* * * * *',
)
def shouldRunDagForCustomer(customer, ...):
currentStateOfJobs = ....
situationalAwarenessOfWorld = .... // check all sort of interesting metrics
if some criteria is met for this customer: return true // run the dag for this customer
else : return false
for customer in DB:
check_run_conditions = ShortCircuitOperator(
task_id='check_run_conditions_' + customer,
python_callable=shouldRunDagForCustomer,
op_args=[customer],
op_kwargs={...}, # extra params if needed
dag=dag,
)
trigger_run = TriggerDagRunOperator(
task_id='trigger_run_' + customer,
trigger_dag_id='target_customer_pipeline_' + customer, # standardize on DAG ids for per customer DAGs
conf={"foo": "bar"}, # pass on extra info to triggered DAG if needed
dag=dag,
)
check_run_conditions >> trigger_run
Then your target DAG is just the per customer work.
This is probably not the only way you could implement something like this, but basically yes I think it's viable to implement in Airflow.

Reference filename via xcom in Airflow

I'm trying to understand how to pass values via airflow xcom functionality. The specific usecase I am trying to build is to write a file, then move it, then run another command. The idea is that I pass the file name from one operator to the next.
Here is what I have:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime as dt
DAG = DAG(
dag_id='xcom_test_dag',
start_date=dt.datetime.now(),
schedule_interval='#once'
)
def push_function(**context):
file_name = 'test_file_{date}'.format(date=dt.datetime.now())
return context['task_instance'].xcom_push(key='filename', value=file_name)
def pull_function(**context):
dir(context['task_instance'].xcom_pull())
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
If I want to reference the file name in the pull_task so I could perform read the file - how should I call that? Trying to access context['task_instance'] does not contain a value. Further - is it best practices to try and reference a file name like this from task to task/operator to operator?
When pulling data from XCOM, you want to provide the task ID of the task where you push the data. In your example, the task_id of your push task is push_task, so you'd want to do something like:
value = context['task_instance'].xcom_pull(task_ids='push_task')
However, from the airflow documentation, note that:
By default, xcom_pull() filters for the keys that are automatically given to XComs when they are pushed by being returned from execute functions (as opposed to XComs that are pushed manually).
If you're pushing data to XCOM manually with specific keys, you may need to include that key when calling xcom_pull. In your example, you push a key called filename in your push task, so you'd likely need to do something like this in your pull task:
value = context['task_instance'].xcom_pull(task_ids='push_task', key='filename')
This information is outlined in further detail in the Airflow documentation: https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#concepts-xcom
As for your question regarding "best practices" - for communicating between Airflow Tasks/Operators, XCOM is the best way to go. However, if you're wanting to read a file from disk across multiple operators, you would need to ensure that all your workers have access to where the file is stored. If that isn't possible, an alternative could be to have the push task store that file remotely (e.g. in AWS S3) and push the S3 URL to XCOM. The pull task could then read the S3 URL from XCOM, and download the file from S3.

When does a airflow dag definition get evaluated?

Suppose I have an airflow dag file that creates a graph like so...
def get_current_info(filename)
current_info = {}
<fill in info in current_info relevant for today's date for given file>
return current_info
files = [
get_current_info("file_001"),
get_current_info("file_002"),
....
]
for f in files:
<some BashOperator bo1 using f's current info dict>
<some BashOperator bo2 using f's current info dict>
....
bo1 >> bo2
....
Since these values in the current_info dict that is used to define the dag changes periodically (here, daily), I would like to know by what process / schedule the dag definition gets updated. (I print the current_info values each run and values appear to be updating, but curious as to how and when exactly this happens).
When does a airflow dag definition get evaluated? referenced anywhere in the docs?
The DAGs are evaluated in every run of the scheduler.
This article describes how the scheduler works and at what stage the DAG files are picked up for evaluation.
After some discussion on the [airflow email list][1], it turns out that airflow builds the dag for each task when it is run (so each tasks includes the overhead of building the dag again (which in my case was very significant)).
See more details on this here: https://stackoverflow.com/a/59995882/8236733

Airflow depends_on_past for whole DAG

Is there a way in airflow of using the depends_on_past for an entire DagRun, not just applied to a Task?
I have a daily DAG, and the Friday DagRun errored on the 4th task however the Saturday and Sunday DagRuns still ran as scheduled. Using depends_on_past = True would have paused the DagRun on the same 4th task, however the first 3 tasks would still have run.
I can see in the DagRun DB table there is a state column that contains failed for the Friday DagRun. What I want is a way configuring a DagRun to not start if the previous DagRun failed, not start and run until finding a Task that previously failed.
Does anyone know if this is possible?
At your first task, set depends_on_past=True and wait_for_downstream=True, the combination will result in that current dag-run runs only if the last run succeeded.
Because by setting the first task at current dag-run would waits for previous
(depends_on_past) and all tasks (wait_for_downstream) to succeed
This question is a bit old but it turns out as a first google search result and the highest rated answer is clearly misleading (and it has made me struggle a bit) so it definitely demands a proper answer. Although the second rated answer should work, there's a cleaner way to do this and I personally find using xcom ugly.
The Airflow has a special operator class designed for monitoring status of tasks from other dag runs or other dags as a whole. So what we need to do is to add a task preceding all the tasks in our dag, checking if the previous run has succeded.
from airflow.sensors.external_task_sensor import ExternalTaskSensor
previous_dag_run_sensor = ExternalTaskSensor(
task_id = 'previous_dag_run_sensor',
dag = our_dag,
external_dag_id = our_dag.dag_id,
execution_delta = our_dag.schedule_interval
)
previous_dag_run_sensor.set_downstream(vertices_of_indegree_zero_from_our_dag)
One possible solution would be to use xcom:
Add 2 PythonOperators start_task and end_task to the DAG.
Make all other tasks depend on start_task
Make end_task depend on all other tasks (set_upstream).
end_task will always push a variable last_success = context['execution_date'] to xcom (xcom_push). (Requires provide_context = True in the PythonOperators).
And start_task will always check xcom (xcom_pull) to see whether there exists a last_success variable with value equal to the previous DagRun's execution_date or to the DAG's start_date (to let the process start).
Example use of xcom:
https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_xcom.py
Here a solution that addresses Marc Lamberti's concern, namely, that 'wait_for_download' is not "recursive".
The solution entails "embedding" your original DAG in between two dummy tasks, a start_task and an end_task.
Such that:
The start_task precedes all your original initial tasks (ie, no other task in your DAG can start until start_task is completed).
A end_task follows all your original ending tasks (ie, all branches in your DAG converge to that dummy end_task).
start_task also directly precedes the end_task.
These conditions are provided by the following code:
start_task >> [_all_your_initial_tasks_here_]
[_all_your_ending_tasks_here] >> end_task
start_task >> end_task
Additionally, one needs to set that start_task has depends_on_past=True and wait_for_downstream=True

Resources