My idea is to have a task foo which generates a list of inputs (users, reports, log files, etc), and a task is launched for every element in the input list. The goal is to make use of Airflow's retrying and other logic, instead of reimplementing it.
So, ideally, my DAG should look something like this:
The only variable here is the number of tasks generated. I want to do some more tasks after all of these are completed, so spinning up a new DAG for every task does not seem appropriate.
This is my code:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1)
}
dag = DAG('dynamic_dag_generator', schedule_interval=None, default_args=default_args)
foo_operator = BashOperator(
task_id='foo',
bash_command="echo '%s'" % json.dumps(range(0, random.randint(40,60))),
xcom_push=True,
dag=dag)
def gen_nodes(**kwargs):
ti = kwargs['ti']
workers = json.loads(ti.xcom_pull(task_ids='foo'))
for wid in workers:
print("Iterating worker %s" % wid)
op = PythonOperator(
task_id='test_op_%s' % wid,
python_callable=lambda: print("Dynamic task!"),
dag=dag
)
op.set_downstream(bar_operator)
op.set_upstream(dummy_op)
gen_subdag_node_op = PythonOperator(
task_id='gen_subdag_nodes',
python_callable=gen_nodes,
provide_context=True,
dag=dag
)
gen_subdag_node_op.set_upstream(foo_operator)
dummy_op = DummyOperator(
task_id='dummy',
dag=dag
)
dummy_op.set_upstream(gen_subdag_node_op)
bar_operator = DummyOperator(
task_id='bar',
dag=dag)
bar_operator.set_upstream(dummy_op)
In the logs, I can see that gen_nodes is executed correctly (i.e. Iterating worker 5, etc). However, the new tasks are not scheduled and there is no evidence that they were executed.
I found related code samples online, such as this, but could not make it work. Am I missing something?
Alternatively, is there a more appropriate approach to this problem (isolating units of work)?
At this point in time, airflow does not support adding/removing a task while the dag is running.
The workflow order will be whatever is evaluated at the start of the dag run.
See the second paragraph here.
This means you cannot add/remove tasks based on something that happens in the run. You can add X tasks in a for loop based on something not related to the run, but after the run has begun there is no changing the workflow shape/order.
Many times you can instead use a BranchPythonOperator to make a decision during a dag run, (and these decisions can be based on your xcom values) but they must be a decision to go down a branch that already exists in the workflow.
Dag runs, and Dag definitions are separated in airflow in ways that aren't entirely intuitive, but more or less anything that is created/generated inside a dag run (xcom, dag_run.conf, etc.) is not usable for defining the dag itself.
Related
Im trying to make a DAG on which two (or more) tasks should run at the same time while a downstream task should wait for them to finish before running.
Something similar to this:
What I was trying to run the following code:
dag = DAG(
'test',
default_args={"start_date": datetime(2019, 12, 5)},
schedule_interval=None
)
start = DummyOperator(task_id='start', dag=dag)
end_opr = DummyOperator(task_id='end_opr', dag=dag)
dummy1 = DummyOperator(task_id='dummy', dag=dag)
dummy2 = DummyOperator(task_id='dummy2', dag=dag)
start >> [dummy1, dummy2] >> end_opr
But what I get is a duplicate of the end_opr instead of dummy1 and dummy2 to join him at the end.
What Im doing wrong?
My env: composer-1.17.2-airflow-1.10.15
What you are doing is correct! However you are comparing the Graph view (first image) vs the Tree view (second image). The Tree view shows a DAG for each distinct root-to-leaf path. The end_opr task is not truly duplicated but rather appears twice because it is part of 2 distinct paths. Check out the Graph view in the UI for this DAG; you should see what you are aiming for there.
I want to create a DAG to run in Google Cloud Composer. The workflow contains a ParallelFor and I don´t know how to model that.
The workflow looks something like this:
task1 >> task2 >> task3 >> task4
where task2 splits data into x arrays. Now, I want to run task3 in parallel for these x arrays. Task3 outputs something and task4 combines the outputs.
(you can find a picture of the workflow here: https://github.com/Apollo-Workflows/Sentiment-Analysis)
For now, I have two possible ideas how it could work:
There is an easy syntax for it (like >> for sequential execution). But I did not found such syntax
Working with sub-DAGs. My idea was to append task2 so that it creates x subDAGs (one for each array). The subDAG is basically task3. After all subDAGs are finished, their output is forwarded to task4. Is that possible? If yes, how do I do it?
I have found a solution for my problem. It follows my first possible solution idea. Just use the mechanics from this link:
Airflow rerun a single task multiple times on success
I believe that the post you mentioned as a possible idea, points on the direction on how to run a task after the previous one has ended.
To run dags in parallel you should follow a structure similar to this
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
dag = DAG("dag_paralel", description="Starting tutorial", schedule_interval=None,
start_date=datetime(2019, 1, 1),
catchup=False)
task_1 = BashOperator(task_id='task_1', bash_command='echo "This is task 1!"',dag=dag)
task_2 = BashOperator(task_id='task_2', bash_command='echo "This is task 2!"',dag=dag)
task_list = []
max_attempt = 3
for attempt in range(max_attempt):
data_pull = BashOperator(
task_id='task_3_{}'.format(attempt),
bash_command='echo "This is task - 3_{}!"'.format(attempt),
dag=dag
)
task_list.append(data_pull)
data_validation = BashOperator(task_id='task_final', bash_command='echo "We are at the end"',dag=dag)
task_1 >> task_2 >> task_list
task_list >> data_validation
This is the DAG structure obtained by this method
Is there a way to persist an XCOM value during re-runs of a DAG step (after clearing the status)?
Below is a simplified version of what I'm trying to accomplish, namely when a DAG step status is cleared and the step re-run, I would like to be able to load the XCOM value pushed on the previous run. However, even though I can see the value in the XCOM interface, the value does not get pulled. I've looked through the source code for the pull_xcom() method but can't figure out where it is being filtered out.
The functionality I'm trying to achieve is to maintain some amount of state between failed runs of a DAG. In the example, this would mean that 1 is added to the stored value every time the DAG step is cleared and rerun.
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def test_step(**kwargs):
ti = kwargs.get('task_instance')
value = ti.xcom_pull(key='key', include_prior_dates=True)
if value is None:
value = 0
print(f'BEFORE VALUE: {value}')
value += 1
print(f'AFTER VALUE: {value}')
ti.xcom_push(key='key', value=value)
# Simulating a failure
raise Exception
default_args = {
'owner': 'Testing',
'depends_on_past': False,
'email': ['test#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
}
dag = DAG(
'test_dag',
default_args=default_args,
schedule_interval=None,
start_date=datetime(2020, 4, 9),
)
t1 = PythonOperator(
task_id='test_step',
provide_context=True,
python_callable=test_step,
dag=dag,
)
t1
Anytime a task is about to run, its XCom is cleared for the current execution date (https://github.com/apache/airflow/blob/1.10.10/airflow/models/taskinstance.py#L960). This is why you won't ever pull values from previous task tries. Use of include_prior_dates=True only pulls from previous execution dates, but not previous runs of the same execution date.
One possible solution is to put a DummyOperator task upstream of your test_step task, called say xcom_store.test_step. Then use airflow.models.XCom.set() directly in test_step to your XCom values into the xcom_store.test_step task (reference xcom_push() as an example). When you need to pull, just pull as you usually would with but from the dummy task instead, i.e. ti.xcom_pull(task_ids='xcom_store.test_step', key='key'). Definitely not ideal and could lead to some confusion, but if you standardize it and build some helpers around it, it could be alright?
I have scheduled the execution of a DAG to run daily.
It works perfectly for one day.
However each day I would like to re-execute not only for the current day {{ ds }} but also for the previous n days (let's say n = 7).
For example, in the next execution scheduled to run on "2018-01-30" I would like Airflow not only to run the DAG using as execution date "2018-01-30", but also to re-run the DAGs for all the previous days from "2018-01-23" to "2018-01-30".
Is there an easy way to "invalidate" the previous execution so that a backfill is run automatically?
You can generate dynamically tasks in a loop and pass the offset to your operator.
Here is an example with the Python one.
import airflow
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import timedelta
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'schedule_interval': '0 10 * * *'
}
def check_trigger(execution_date, day_offset, **kwargs):
target_date = execution_date - timedelta(days=day_offset)
# use target_date
for day_offset in xrange(1, 8):
PythonOperator(
task_id='task_offset_' + i,
python_callable=check_trigger,
provide_context=True,
dag=dag,
op_kwargs={'day_offset' : day_offset}
)
Have you considered having the dag that runs once a day just run your task for the last 7 days? I imagine you’ll just have 7 tasks that each spawn a SubDAG with a different day offset from your execution date.
I think that will make debugging easier and history cleaner. I believe trying to backfill already executed tasks will involve deleting task instances or setting their states all to NONE. Then you’ll still have to trigger a backfill on those dag runs. It’ll be harder to track when things fail and just seems a bit messier.
I have a subdag as one of the nodes of a main DAG. The workflow works fine.
I tried to increase the levels of hierarchy by including another subdag inside the subdag. But airflow seem to get confused. Couple of questions in this regard:
1) Does airflow support subdag inside a subdag? If so, is there a limit to the hierarchy?
2) Are there any best practices in using a subdag inside a subdag?
I have recently started using Airflow myself and Airflow does indeed support having subdags inside subdags. I was able to go upto 4 levels deep but I'm not sure as to the exact limit of the hierarchy. Hope that helps!
short answer yes you can.
I managed to achieve this by using subdag create functions following https://github.com/geosolutions-it/evo-odas/wiki/Airflow---about-subDAGs,-branching-and-xcom
I Achieved 3 levels of subdag hierarchy.
Trick is to be extremely careful to follow the maindag.subdag notation in your tasks.
task = SubDagOperator(
subdag=create_subdag(dag_name, task_id, datetime(2019, 1, 29) , None),
task_id=task_id,
dag=dag
)
And in the create_subdag function you need to work carefully with the parent and child dags names, nesting several such functions does the job.
You'll see the errors in the UI if any.
I will update this post with more code if needed but it really depends on how many levels you need so you design the functions like that.
I recently started experimenting with subdags. I don't think there is a limit on how deep they can go. A lot of webpages suggest to stay away from subdags because of the issue with connection pools.
I made an example here with 2 levels. It can be refactored further but it demonstrates the point.
1) Create a helper method to create dags with subdag names formatted as parent.child.
def create_sub_dag(parent_dag_name, child_dag_name, start_date, schedule_interval):
''''Returns a DAG which has the dag_id formatted as parent.child '''
return DAG(
dag_id='{}.{}'.format(parent_dag_name, child_dag_name),
schedule_interval=schedule_interval,
start_date=start_date,
max_active_runs=15
)
2) Then recursively create and assign subdags to parent dags. Remember, subdags are still DAGs. SubDagOperator only bundles it as a task for the parent dag.
level1_list = ['AWS', 'AZURE']
level2_list = ['eu', 'us', 'ap', 'jp']
tasks = ['task_{}'.format(str(i)) for i in range(0, 10)]
for level1_item in level1_list:
level1_dag = create_sub_dag(dag_id, level1_item, datetime(2020, 3, 10), '0 6 * * *')
level1_subdag_operator = SubDagOperator(
subdag=level1_dag,
task_id=level1_item,
dag=dag,
)
level1_dag_id = '{}.{}'.format(dag_id, level1_item)
for level2_item in level2_list:
level2_dag = create_sub_dag(level1_dag_id, level2_item, datetime(2020, 3, 10), '0 6 * * *')
level2_subdag_operator = SubDagOperator(
subdag=level2_dag,
task_id=level2_item,
dag=level1_dag,
)
level2_dag_id = '{}.{}.{}'.format(dag_id, level1_item, level2_item)
create_tasks(level2_dag, tasks)