Airflow retry of multiple tasks as a group - airflow

I have a group of tasks that should run as a unit, in the sense if any of the tasks from the group fail, the whole group should be marked as failed.
I want to be able to retry the group when it has failed.
For example, I have a DAG with these tasks:
taskA >> (taskB >> taskC) >> taskD
I want to say that (taskB >> taskC) is a group.
If either taskB or taskC fails, I want to be able to rerun the whole group (taskB >> taskC).

This is a two parts question.
First, In Airflow downstream task can not effect upstream task. Assuming structure of:
taskA >> taskB >> taskC >> taskD
then if taskB is successful and taskC failed. it can not change the state of taskB to failed.
Second, clearing (rerun) a TaskGroup is a feature that currently is not available. There is an open feature request for it in Airflow repo. You can view it in this link.

Related

How to re-run all failed tasks in Apache Airflow?

I have a Apache Airflow DAG with tens of thousands of tasks and after a run, say a handful of them failed.
I fixed the bug that caused some tasks to fail and I would like to re-run ONLY FAILED TASKS.
This SO post suggests using the GUI to "clear" failed task:
How to restart a failed task on Airflow
This approach works if you have a handful number of failed tasks.
I am wondering if we can bypass the GUI and do it problematically, through command line something like:
airflow_clear_failed_tasks dag_id execution_data
Use the following command to clear only failed tasks:
airflow clear [-s START_DATE] [-e END_DATE] --only_failed dag_id
Documentation: https://airflow.readthedocs.io/en/stable/cli.html#clear
The command to clear only failed tasks was updated. It is now (Airflow 2.0 as of March 2021):
airflow tasks clear [-s START_DATE] [-e END_DATE] --only-failed dag_id

Rerun Airflow Dag From Middle Task and continue to run it till end of all downstream tasks.(Resume Airflow DAG from any task)

Hi i am new to Apache Airflow i have dag of dependancies lets say
Task A >> Task B >> Task C >> Task D >> Task E
Is it possible to run Airflow DAG from middle task lets say Task C ?
Is it possible to run only specific branch in case of branching
operator in middle?
Is it possible to resume Airflow DAG from last failure task?
If not possible how to manage large DAG's and avoid rerunning
redundant tasks?
Please provide me suggestions on how to implement this if possible.
You can't do it manually. If you set BranchPythonOperator you can skip tasks till the task you wish to start with according to the conditions set in the BranchPythonOperator
Same as 1.
yes. You can clear tasks upstream till root or down stream till all leaves of the node.
You can do something like:
Task A >> Task B >> Task C >> Task D
Task C >> Task E
Where C is the branch operator.
For example:
from datetime import date
def branch_func():
if date.today().weekday() == 0:
return 'task id of D'
else:
return 'task id of E'
Task_C = BranchPythonOperator(
task_id='branch_operation',
python_callable=branch_func,
dag=dag)
This will be task sequence on Monday :
Task A >> Task B >> Task C >> Task D
This will be task sequence on rest of the week:
Task A >> Task B >> Task C >> Task E

Can Snakemake work if a rule's shell command is a cluster job?

In below example, if shell script shell_script.sh sends a job to cluster, is it possible to have snakemake aware of that cluster job's completion? That is, first, file a should be created by shell_script.sh which sends its own job to the cluster, and then once this cluster job is completed, file b should be created.
For simplicity, let's assume that snakemake is run locally meaning that the only cluster job originating is from shell_script.sh and not by snakemake .
localrules: that_job
rule all:
input:
"output_from_shell_script.txt",
"file_after_cluster_job.txt"
rule that_job:
output:
a = "output_from_shell_script.txt",
b = "file_after_cluster_job.txt"
shell:
"""
shell_script.sh {output.a}
touch {output.b}
"""
PS - At the moment, I am using sleep command to give it a waiting time before the job is "completed". But this is an awful workaround as this could give rise to several problems.
Snakemake can manage this for you with the --cluster argument on the command line.
You can supply a template for the jobs to be executed on the cluster.
As an example, here is how I use snakemake on a SGE managed cluster:
template which will encapsulate the jobs which I called sge.sh:
#$ -S /bin/bash
#$ -cwd
#$ -V
{exec_job}
then I use directly on the login node:
snakemake -rp --cluster "qsub -e ./logs/ -o ./logs/" -j 20 --jobscript sge.sh --latency-wait 30
--cluster will tell which queuing system to use
--jobscript is the template in which jobs will be encapsulated
--latency-wait is important if the file system takes a bit of time to write the files. You job might end and return before the output of the rules are actually visible to the filesystem which will cause an error
Note that you can specify rules not to be executed on the nodes in the Snakefile with the keyword localrules:
Otherwise, depending on your queuing system, some options exist to wait for job sent to cluster to finish:
SGE:
Wait for set of qsub jobs to complete
SLURM:
How to hold up a script until a slurm job (start with srun) is completely finished?
LSF:
https://superuser.com/questions/46312/wait-for-one-or-all-lsf-jobs-to-complete

Schedule a job in Unix

I am pretty new to Unix environment.
I am trying to schedule two tasks in Unix server. The second task is dependent on the result of the first task. So, I want to run the first task. If there is no error then I want the second task to run automatically. But if the first task fails, I want to reschedule the first task again after 30 minutes.
I have no idea where to start from.
You don't need cron for this. A simple shell script is all you need:
#!/bin/sh
while :; do # Loop until the break statement is hit
if task1; then # If task1 is successful
task2 # then run task2
break # and we're done.
else # otherwise task1 failed
sleep 1800 # and we wait 30min
fi # repeat
done
Note that task1 must indicate success with an exit status of 0, and failure with nonzero.
As Wumpus sharply observes, this can be simplified to
#!/bin/sh
until task1; do
sleep 1800
done
task2

Shell Script to Check for Status of a informatica workflow

We have two Informatica jobs that run in parallel.
One starts at 11.40 CET and it has around 300 Informatica workflows in it out of which one is fact_sales.
The other job runs at 3.40 CET and it has around 115 workflows in it many of which are dependent on fact_sales in term of data consistency.
The problem is fact_sales should finish before certain workflows in process 2 starts for data to be accurate, but this doesnt happen generally.
What we are trying to do is to split the process 2 in such a way that fact_sales dependent workflows run only after the fact_sales has finished.
Can you provide me a way to go about writing a unix shell script that check the status of this fact_sales and if it successfull then kicks off other dependent workflows and if not then it should send a failure mail.
thanks
I don't see the need to write a custom shell script for this. Most of this is pretty standard/common functionality that can be implemented using Command Task and event waits.
**Process1 - runs at 11:50**
....workflow
...
fact_sales workflow. **Add a command task at the end
**that drops a flag, say, fact_sales_0430.done
...
....workflow..500
And all the dependent processes will have an event wait that waits on this .done file. Since there are multiple dependant workflows, make sure none of them deletes the file right away. You can drop this .done file at the end of the day or when the load starts for the next day.
workflow1
.....
dependantworkflow1 -- Event wait, waiting on fact_sales_0430.done (do not delete file).
dependantworkflow2 -- Event wait, waiting on fact_sales_0430.done (do not delete file).
someOtherWorkflow
dependantworkflow3 -- Event wait, waiting on fact_sales_0430.done (do not delete file).
....
......
A second approach can be as follows -
You must be running some kind of scheduler for launching these workflows.. since Informatica cant schedule multiple workflows in a set, it can only handle worklet/sessions at that level of dependency mgmt.
From the scheduler, create a dependency across the sales fact load wf and the other dependent workflows..
I think below mentioned script will work for you. Please udpate the parameters.
WAIT_LOOP=1
while [ ${WAIT_LOOP} -eq 1 ]
do
WF_STATUS=`pmcmd getworkflowdetails -sv $INFA_INTEGRATION_SERVICE -d $INFA_DOMAIN -uv INFA_USER_NAME -pv INFA_PASSWORD -usd Client -f $FOLDER_NAME $WORKFLOW_NAME(fact_sales) | grep "Workflow run status:" | cut -d'[' -f2 | cut -d']' -f1`
echo ${WF_STATUS} | tee -a $LOG_FILE_NAME
case "${WF_STATUS}" in
Aborted)
WAIT_LOOP=0
;;
Disabled)
WAIT_LOOP=0
;;
Failed)
WAIT_LOOP=0
;;
Scheduled)
WAIT_LOOP=0
;;
Stopped)
WAIT_LOOP=0
;;
Succeeded)
WAIT_LOOP=0
;;
Suspended)
WAIT_LOOP=0
;;
Terminated)
WAIT_LOOP=0
;;
Unscheduled)
WAIT_LOOP=0
;;
esac
if [ ${WAIT_LOOP} -eq 1 ]
then
sleep $WAIT_SECONDS
fi
done
if [ ${WF_STATUS} == "Succeeded" ]
then
pmcmd startworkflow -sv $INFA_INTEGRATION_SERVICE -d $INFA_DOMAIN -uv INFA_USER_NAME -pv INFA_PASSWORD -usd Client -f $FOLDER_NAME -paramfile $PARAMETER_FILE $WORKFLOW_NAME(dependent_one) | tee $LOG_FILE_NAME
else
(echo "Please find attached Logs for Run" ; uuencode $LOG_FILE_NAME $LOG_FILE_NAME )| mailx -s "Execution logs" $EMAIL_LIST
exit 1
fi
I can see you have main challenge - keep dependency between large number of infa workflows.
You have two options-
You can use some automated scheduling tool to set the dependency and run them one by one properly. There are many free tool but depending on your comfort/time/cost etc. you should choose. link here.
Secondly you can create your custom job-scheduler. I did a similar scheduler using UNIX script, oracle table. So here are steps for that -
Categorize all your workflows into groups. independent flow should go to group 1 and dependent flows on group 1 goes to group2 and so on.
Set your process to pick up one by one from above groups and kick them off. If kick off queue is empty then it should wait. call it loop2.
Keep a polling loop that will check status of kicked off flows. If failed, aborted etc. fail the process, mail to user and mark all 'in-queue/dependent' flows to failed. If running keep on polling. If succeeded give control to loop 2.
-if kick off queue is empty then go to next group only if all workflow in that group succeeded.
This is a bit tricky process but it paid off once you set it up. You can add as many workflows as you want and your maintenance will be much more smoother compared to infa scheduler or infa worklet etc.
You can fire a query from repository database using tables such REP_SESS_LOG and check if the status of the fact sales has succeeded or not. Then only you can proceed with the second job.

Resources