Im trying to make a DAG on which two (or more) tasks should run at the same time while a downstream task should wait for them to finish before running.
Something similar to this:
What I was trying to run the following code:
dag = DAG(
'test',
default_args={"start_date": datetime(2019, 12, 5)},
schedule_interval=None
)
start = DummyOperator(task_id='start', dag=dag)
end_opr = DummyOperator(task_id='end_opr', dag=dag)
dummy1 = DummyOperator(task_id='dummy', dag=dag)
dummy2 = DummyOperator(task_id='dummy2', dag=dag)
start >> [dummy1, dummy2] >> end_opr
But what I get is a duplicate of the end_opr instead of dummy1 and dummy2 to join him at the end.
What Im doing wrong?
My env: composer-1.17.2-airflow-1.10.15
What you are doing is correct! However you are comparing the Graph view (first image) vs the Tree view (second image). The Tree view shows a DAG for each distinct root-to-leaf path. The end_opr task is not truly duplicated but rather appears twice because it is part of 2 distinct paths. Check out the Graph view in the UI for this DAG; you should see what you are aiming for there.
Related
Suppose I have Airflow tasks like this:
apple_task = DummyOperator(
task_id='apple'
)
banana_task = DummyOperator(
task_id='banana'
)
cherry_task = DummyOperator(
task_id='cherry'
)
apple_task >> cherry_task
banana_task >> cherry_task
Do the repeated applications of >> stack or replace the previous one?
What will the graph look like?
Airflow 2.2.2
They stack, as in apple_task and banana_task will be ran in parallel, both must succeed to run cherry_task.
It's equivalent to [apple_task, banana_task] >> cherry_task.
The scheduler parses the DAGs (every 30s by default), and the DAG is read and the graph is constructed. An advantage to specifying task dependencies as you did, you can dynamically create tasks at parse time - as they're just python objects.
The DAG documentation page has some more examples under the task dependencies heading here and the control flow heading here.
I want to create a DAG to run in Google Cloud Composer. The workflow contains a ParallelFor and I don´t know how to model that.
The workflow looks something like this:
task1 >> task2 >> task3 >> task4
where task2 splits data into x arrays. Now, I want to run task3 in parallel for these x arrays. Task3 outputs something and task4 combines the outputs.
(you can find a picture of the workflow here: https://github.com/Apollo-Workflows/Sentiment-Analysis)
For now, I have two possible ideas how it could work:
There is an easy syntax for it (like >> for sequential execution). But I did not found such syntax
Working with sub-DAGs. My idea was to append task2 so that it creates x subDAGs (one for each array). The subDAG is basically task3. After all subDAGs are finished, their output is forwarded to task4. Is that possible? If yes, how do I do it?
I have found a solution for my problem. It follows my first possible solution idea. Just use the mechanics from this link:
Airflow rerun a single task multiple times on success
I believe that the post you mentioned as a possible idea, points on the direction on how to run a task after the previous one has ended.
To run dags in parallel you should follow a structure similar to this
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
dag = DAG("dag_paralel", description="Starting tutorial", schedule_interval=None,
start_date=datetime(2019, 1, 1),
catchup=False)
task_1 = BashOperator(task_id='task_1', bash_command='echo "This is task 1!"',dag=dag)
task_2 = BashOperator(task_id='task_2', bash_command='echo "This is task 2!"',dag=dag)
task_list = []
max_attempt = 3
for attempt in range(max_attempt):
data_pull = BashOperator(
task_id='task_3_{}'.format(attempt),
bash_command='echo "This is task - 3_{}!"'.format(attempt),
dag=dag
)
task_list.append(data_pull)
data_validation = BashOperator(task_id='task_final', bash_command='echo "We are at the end"',dag=dag)
task_1 >> task_2 >> task_list
task_list >> data_validation
This is the DAG structure obtained by this method
Airflow graph task branch never runs, complains “Task instance did not exist in the DB”, but can see in graph.
I have an airflow graph with a conditional branch defined like
class BranchFlags(Enum):
yes = "yes"
no = "no"
...
for table in list_of_tables # type list(dict)
task_1 = BashOperator(
task_id='task_1_%s' % table["conf1"],
bash_command='bash script1.sh %s' % table["conf1"],
dag=dag)
if table["branch_flag"] == BranchFlags.yes:
consolidate = BashOperator(
task_id='task_3_%s' % table["conf2"],
bash_command='python %s/consolidate_parquet.py %s' % table["conf2"],
dag=dag)
task_3 = BashOperator(
task_id='task_3_%s' % table["conf3"],
bash_command='bash script3.sh %s' % table["conf3"],
dag=dag)
task_1 >> task_3
if table["branch_flag"] == BranchFlags.yes:
task_1 >> task_2
and here is the graph in the airflow UI from my actual code:
Notice that even though the longer parts of the graph run fine, the lone branch is not being run for the one sequence that was supposed to branch. When viewing the logs for the task, I see
*** Task instance did not exist in the DB
This is weird to me, since ostensibly the scheduler DB sees the task since it does appear in the web UI graph. Not sure what is going on here and adding other changes to the dag .py file do show up in the graph and are executed by the scheduler when running the graph. And attempting to view the tasks Task Instance Details throws error
Task [dagname.task_3_qwerty] doesn't seem to exist at the moment
Running airflow resetdb (as I've seen in other posts) does nothing for the problem.
Note that the intention is that the short branch runs concurrently with the longer branch (not as an either or choice).
Anyone know why this would be happening or have some debugging tips?
My idea is to have a task foo which generates a list of inputs (users, reports, log files, etc), and a task is launched for every element in the input list. The goal is to make use of Airflow's retrying and other logic, instead of reimplementing it.
So, ideally, my DAG should look something like this:
The only variable here is the number of tasks generated. I want to do some more tasks after all of these are completed, so spinning up a new DAG for every task does not seem appropriate.
This is my code:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1)
}
dag = DAG('dynamic_dag_generator', schedule_interval=None, default_args=default_args)
foo_operator = BashOperator(
task_id='foo',
bash_command="echo '%s'" % json.dumps(range(0, random.randint(40,60))),
xcom_push=True,
dag=dag)
def gen_nodes(**kwargs):
ti = kwargs['ti']
workers = json.loads(ti.xcom_pull(task_ids='foo'))
for wid in workers:
print("Iterating worker %s" % wid)
op = PythonOperator(
task_id='test_op_%s' % wid,
python_callable=lambda: print("Dynamic task!"),
dag=dag
)
op.set_downstream(bar_operator)
op.set_upstream(dummy_op)
gen_subdag_node_op = PythonOperator(
task_id='gen_subdag_nodes',
python_callable=gen_nodes,
provide_context=True,
dag=dag
)
gen_subdag_node_op.set_upstream(foo_operator)
dummy_op = DummyOperator(
task_id='dummy',
dag=dag
)
dummy_op.set_upstream(gen_subdag_node_op)
bar_operator = DummyOperator(
task_id='bar',
dag=dag)
bar_operator.set_upstream(dummy_op)
In the logs, I can see that gen_nodes is executed correctly (i.e. Iterating worker 5, etc). However, the new tasks are not scheduled and there is no evidence that they were executed.
I found related code samples online, such as this, but could not make it work. Am I missing something?
Alternatively, is there a more appropriate approach to this problem (isolating units of work)?
At this point in time, airflow does not support adding/removing a task while the dag is running.
The workflow order will be whatever is evaluated at the start of the dag run.
See the second paragraph here.
This means you cannot add/remove tasks based on something that happens in the run. You can add X tasks in a for loop based on something not related to the run, but after the run has begun there is no changing the workflow shape/order.
Many times you can instead use a BranchPythonOperator to make a decision during a dag run, (and these decisions can be based on your xcom values) but they must be a decision to go down a branch that already exists in the workflow.
Dag runs, and Dag definitions are separated in airflow in ways that aren't entirely intuitive, but more or less anything that is created/generated inside a dag run (xcom, dag_run.conf, etc.) is not usable for defining the dag itself.
I need the status of the task like if it is running or upforretry or failed within the same dag. So i tried to get it using the below code, though i got no output...
Auto = PythonOperator(
task_id='test_sleep',
python_callable=execute_on_emr,
op_kwargs={'cmd':'python /home/hadoop/test/testsleep.py'},
dag=dag)
logger.info(Auto)
The intention is to kill certain running tasks once a particular task on airflow completes.
Question is how do i get the state of a task like is it in the running state or failed or success
I am doing something similar. I need to check for one task if the previous 10 runs of another task were successful.
taky2 sent me on the right path. It is actually fairly easy:
from airflow.models import TaskInstance
ti = TaskInstance(*your_task*, execution_date)
state = ti.current_state()
As I want to check that within the dag, it is not neccessary to specify the dag.
I simply created a function to loop through the past n_days and check the status.
def check_status(**kwargs):
last_n_days = 10
for n in range(0,last_n_days):
date = kwargs['execution_date']- timedelta(n)
ti = TaskInstance(*my_task*, date) #my_task is the task you defined within the DAG rather than the task_id (as in the example below: check_success_task rather than 'check_success_days_before')
state = ti.current_state()
if state != 'success':
raise ValueError('Not all previous tasks successfully completed.')
When you call the function make sure to set provide_context.
check_success_task = PythonOperator(
task_id='check_success_days_before',
python_callable= check_status,
provide_context=True,
dag=dag
)
UPDATE:
When you want to call a task from another dag, you need to call it like this:
from airflow import configuration as conf
from airflow.models import DagBag, TaskInstance
dag_folder = conf.get('core','DAGS_FOLDER')
dagbag = DagBag(dag_folder)
check_dag = dagbag.dags[*my_dag_id*]
my_task = check_dag.get_task(*my_task_id*)
ti = TaskInstance(my_task, date)
Apparently there is also an api-call by now doing the same thing:
from airflow.api.common.experimental.get_task_instance import get_task_instance
ti = get_task_instance(*my_dag_id*, *my_task_id*, date)
Take a look at the code responsible for the command line interface operation suggested by Priyank.
https://github.com/apache/incubator-airflow/blob/2318cea74d4f71fba353eaca9bb3c4fd3cdb06c0/airflow/bin/cli.py#L581
def task_state(args):
dag = get_dag(args)
task = dag.get_task(task_id=args.task_id)
ti = TaskInstance(task, args.execution_date)
print(ti.current_state())
Hence, it seem you should easily be able to accomplish this within your DAG codebase using similar code.
Alternatively you could execute these CLI operations from within your code using python's subprocess library.
Okay, I think I know what you're doing and I don't really agree with it, but I'll start with an answer.
A straightforward, but hackish, way would be to query the task_instance table. I'm in postgres, but the structure should be the same. Start by grabbing the task_ids and state of the task you're interested in with a db call.
SELECT task_id, state
FROM task_instance
WHERE dag_id = '<dag_id_attrib>'
AND execution_date = '<execution_date_attrib>'
AND task_id = '<task_to_check>'
That should give you the state (and name, for reference) of the task you're trying to monitor. State is stored as a simple lowercase string.
You can use the command line Interface for this:
airflow task_state [-h] [-sd SUBDIR] dag_id task_id execution_date
For more on this you can refer official airflow documentation:
http://airflow.incubator.apache.org/cli.html