Airflow terminate EMR cluster - airflow

I am using EMR cluster to run some job to run in parallel. Both of these job run in same cluster. I have put action_on_failure field to 'CONTINUE' so that if 1 task fails, the other should run in the cluster. I want end task which is EMRTerminateCluster to run after both these tasks gets completeted regardless of success or failure.
task2
task1 >> >> task4
task3
I want my dags to run in such a way that task4 only starts after task 2 and task3.
is there any way to this?

Related

Airflow retry of multiple tasks as a group

I have a group of tasks that should run as a unit, in the sense if any of the tasks from the group fail, the whole group should be marked as failed.
I want to be able to retry the group when it has failed.
For example, I have a DAG with these tasks:
taskA >> (taskB >> taskC) >> taskD
I want to say that (taskB >> taskC) is a group.
If either taskB or taskC fails, I want to be able to rerun the whole group (taskB >> taskC).
This is a two parts question.
First, In Airflow downstream task can not effect upstream task. Assuming structure of:
taskA >> taskB >> taskC >> taskD
then if taskB is successful and taskC failed. it can not change the state of taskB to failed.
Second, clearing (rerun) a TaskGroup is a feature that currently is not available. There is an open feature request for it in Airflow repo. You can view it in this link.

Apache Airflow: Conditionals running before triggered

I'm having a problem with my DAG. I want to set it up to where if one task fails, another happens and the entire run doesn't fail.
The code is proprietary so I can't post the code snippet. So sorry!
Task0 >> [Task1, Task2]
Task1 >> Task1a
If Task1 fails, I want task2 to execute. If task1 is successful, I want task1a to execute. My current code for task2 looks like this:
task2 = DummyOperator(
task_id='task2',
trigger_rule='one_failed',
dag=dag,
)
I've been playing around with the trigger_rule but this keeps running before task1. It just runs right away.
Your operator is fine. The dependency is wrong.
The Task0 >> [Task1, Task2] >> Task1a means that Task1 can run in parallel with Task2 and the trigger_rule='one_failed' of Task2 is checked against it's direct upstream tasks. This means that the rule is checked against Task0 status not against Task1.
To fix your issue you need to change:
Task0 >> Task1 >> Task2
Task1 >> Task1a

Rerun Airflow Dag From Middle Task and continue to run it till end of all downstream tasks.(Resume Airflow DAG from any task)

Hi i am new to Apache Airflow i have dag of dependancies lets say
Task A >> Task B >> Task C >> Task D >> Task E
Is it possible to run Airflow DAG from middle task lets say Task C ?
Is it possible to run only specific branch in case of branching
operator in middle?
Is it possible to resume Airflow DAG from last failure task?
If not possible how to manage large DAG's and avoid rerunning
redundant tasks?
Please provide me suggestions on how to implement this if possible.
You can't do it manually. If you set BranchPythonOperator you can skip tasks till the task you wish to start with according to the conditions set in the BranchPythonOperator
Same as 1.
yes. You can clear tasks upstream till root or down stream till all leaves of the node.
You can do something like:
Task A >> Task B >> Task C >> Task D
Task C >> Task E
Where C is the branch operator.
For example:
from datetime import date
def branch_func():
if date.today().weekday() == 0:
return 'task id of D'
else:
return 'task id of E'
Task_C = BranchPythonOperator(
task_id='branch_operation',
python_callable=branch_func,
dag=dag)
This will be task sequence on Monday :
Task A >> Task B >> Task C >> Task D
This will be task sequence on rest of the week:
Task A >> Task B >> Task C >> Task E

Airflow run specifying incorrect -sd or -SUBDIRECTORY

I have an Airflow process running every day with many DAGS. Today, all of a sudden, none of the DAGS can be run because when Airflow calls airflow run it misspecifies the -sd directory to find the DAG.
Here's the example:
[2018-09-26 15:18:10,406] {base_task_runner.py:115} INFO - Running: ['bash', '-c', 'airflow run daily_execution dag1 2018-09-26T13:17:50.511269 --job_id 1 --raw -sd DAGS_FOLDER/newfoldernewfolder/dags/all_dags.py']
As you can see, right after -sd, the the subdirectory repeates newfolder twice when it should only state DAGS_FOLDER/newfolder/dags/all_dags.py.
I tried running the DAG with the same files that were running two days ago (when everything was correct) but I get the same error. I'm guessing that something has changed in Airflow configuration but I'm not aware of any changes in airflow.cfg. I've been only managing the UI and airflow run gets called automatically once I turn the DAG on.
Anybody has an idea where airflow run might get this directory and how I can update this?

Schedule a job in Unix

I am pretty new to Unix environment.
I am trying to schedule two tasks in Unix server. The second task is dependent on the result of the first task. So, I want to run the first task. If there is no error then I want the second task to run automatically. But if the first task fails, I want to reschedule the first task again after 30 minutes.
I have no idea where to start from.
You don't need cron for this. A simple shell script is all you need:
#!/bin/sh
while :; do # Loop until the break statement is hit
if task1; then # If task1 is successful
task2 # then run task2
break # and we're done.
else # otherwise task1 failed
sleep 1800 # and we wait 30min
fi # repeat
done
Note that task1 must indicate success with an exit status of 0, and failure with nonzero.
As Wumpus sharply observes, this can be simplified to
#!/bin/sh
until task1; do
sleep 1800
done
task2

Resources