Run DAG with tasks of different interval - airflow

I have 3 tasks, A, B and C. I want to run task A only once, and then run task B monthly until end_date, then run task C only once to clean up.
This is similar to this question, but not applicable. How to handle different task intervals on a single Dag in airflow?
Thanks for your help

For task A that is supposed to run only once, you can take inspiration from here
As far as tasks B & C are concerned, they can be tied up with A using a ShortCircuitOperator (as already told in the link you cited)
-> B
/
A -> ShortCircuit
\
-> C
Alternatively, you could skip B and C internally using an AirflowSkipException

Related

Using Airflow TaskGroup without setting direct dependencies?

The "mainstream" way to use TG is
t1 >> t2 >> task_group >> t3 ...
But in some cases, I'd like to use TG in a different way:
with DAG(...) as dag:
t1 = DummyOperator(task_id="t1")
t2 = DummyOperator(task_id="t2")
t3 = DummyOperator(task_id="t3")
t4 = DummyOperator(task_id="t4")
with TaskGroup(group_id="myTG") as tg1:
tg1 = DummyOperator(task_id="TG1")
tg2 = DummyOperator(task_id="TG2")
tg3 = DummyOperator(task_id="TG3")
##########################################################
# setting direct dependencies on the "outer" DAG tasks: #
##########################################################
tg1.set_upstream(t2)
tg2.set_upstream(t4)
# internal TG structure:
tg1 >> tg3
tg2 >> tg3
t1 >> t2
t3 >> t4
As seen above, the TG is not chained directly to the DAG, but rather the internal tasks have dependencies on the "outer" tasks.
This actually works, and the outcome is as expected: (using Airflow 2.1.0)
However, this behavior is not documented anywhere and I couldn't find any evidence that this is supported by design. My fear is that this might be a side effect or undefined behavior, and might break in future releases.
Does anybody know if it's safe to use this method?
Airflow Task Group is a collection of tasks, which represents a part of the dag (sub dag), this collection has roots tasks (the tasks which don't have any upstream task in the same Task Group) and leaves tasks (the tasks which don't have any downstream task in the same Task group).
When you set an upstream for the Task Group, it just sets it as an upstream for all the roots tasks, and when you set a downstream task for the Task Group, it sets it as a downstream task for all the leaves tasks. (method source code)
Your code is completely safe, and you can also leave a task without any upstream, and add an upstream task for the second, which is not possible if you set the upstream directly for the Task Group.
You will not find any example in the doc similar to your code, because the main goal of the Task Group was replacing the Sub Dags, and set dependencies for the whole Task Group and use is as dependency for the other tasks/Task Groups, so your code is not recommended but 100% safe.

Is it possible to arrange execution of several tasks in a transaction-like mode?

By transactional-like I mean - etither all tasks in a set are successful or none of them are and should be retried from the first one.
Consider two operators, A and B, A downloads a file, B reads it and performs some actions.
A successfully executes, but before B comes into play a blackout/corruption occurs, and B cannot process the file, so it fails.
To pass, it needs A to re-download the file, but since A is in success status, there is no direct way to do that automatically. A human must go and clear A`s status
Now, if wonder, if there is a known way to clear the statuses of task instances up to a certain task, if some task fails?
I know I can use a hook, as in clear an upstream task in airflow within the dag, but that looks a bit ugly
This feature is not supported by airflow, but you can use on_failure_callback to achieve that:
def clear_task_A(context):
execution_date = context["execution_date"]
clear_task_A = BashOperator(
task_id='clear_task_A',
bash_command=f'airflow tasks clear -s {execution_date} -t A -d -y <dag_id>'
) # -s: start date, -t: task_id regex -d: include downstream?, -y: yes?
return clear_task_A.execute(context=context)
with DAG(...) as dag:
A = DummyOperator(
task_id='A'
)
B = BashOperator(
task_id='B',
bash_command='exit 1',
on_failure_callback=clear_task_A
)
A >> B

Is there a way to make a Denodo 8 VDP scheduler job WAIT() for a certain amount of time?

I want to make a VDP scheduler job in Denodo 8 wait for a certain amount of time. The wait function in the job creation process is not working as expected so I figured I'd write it into the VQL. However when i try the suggested function from the documentation (https://community.denodo.com/docs/html/browse/8.0/en/vdp/vql/stored_procedures/predefined_stored_procedures/wait) the Denodo 8 VQL shell doesn't recognize the function.
--Not working
SELECT WAIT('10000');
Returns the following error:
Function 'wait' with arity 1 not found
--Not working
WAIT('10000');
Returns the following error:
Error parsing command 'WAIT('10000')'
Any suggestions would be much appreciated.
There are two ways of invoking WAIT:
Option #1
-- Wait for one minute
CALL WAIT(60000);
Option #2:
-- Wait for ten seconds
SELECT timeinmillis
FROM WAIT()
WHERE timeinmillis = 10000;

Using MPI on Slurm, is there a way to send messages between separate jobs?

I'm new to using MPI (mpi4py) and Slurm. I need to run about 50000 tasks, so to obey the administrator-set limit of about 1000, I've been running them like this:
sbrunner.sh:
#!/bin/bash
for i in {1..50}
do
sbatch m2slurm.sh $i
sleep 0.1
done
m2slurm.sh:
#!/bin/bash
#SBATCH --job-name=mpi
#SBATCH --output=mpi_50000.out
#SBATCH --time=0:10:00
#SBATCH --ntasks=1000
srun --mpi=pmi2 --output=mpi_50k${1}.out python par.py data_50000.pkl ${1} > ${1}py.out 2> ${1}.err
par.py (irrelevant stuff omitted):
offset = (int(sys.argv[2])-1)*1000
comm = MPI.COMM_WORLD
k = comm.Get_rank()
d = data[k+offset]
# ... do something with d ...
allresults = comm.gather(result, root=0)
comm.Barrier()
if k == 0:
print(allresults)
Is this a sensible way to get around the limit of 1000 tasks?
Is there a better way to consolidate results? I now have 50 files I have to concatenate manually. Is there some comm_world that can exist between different jobs?
I think you need to make your application divide the work among 1000 tasks (MPI ranks) and consolidate the results after that with MPI collective calls i.e. MPI_Reduce or MPI_AllReduce calls.
trying to work around the limit won't help you as the jobs you started will be queued one after another.
Jobs arrays will give similar behavior like what you did in the batch file you provided. So still your application must be able to processes all data items given only N tasks(MPI ranks).
No need to pool to make sure all other jobs are finished take a look at slurm job dependency parameter
https://hpc.nih.gov/docs/job_dependencies.html
Edit:
You can use job dependeny to make a new job that will run after all other jobs finish and this job will collect the results and merge them into one big file. I still believe you are over thinking the obvious solution make rank 0 (master collect all results and save them to the disk)
This looks like a perfect candidate for job arrays. Each job in an array is identical with the exception of a $SLURM_ARRAY_TASK_ID environment variable. You can use this in the same way that you're using the command line variable.
(You'll need to check that MaxArraySize is set high enough by your sysadmin. Check the output of scontrol show config | grep MaxArraySize )
what do you mean by 50000 tasks ?
do you mean one MPI job with 50000 MPI tasks ?
or do you mean 50000 independant serial programs ?
or do you mean any combination where (number of MPI jobs) * (number of tasks per job) = 5000
if 1), well, consult your system administrator. of course you can allocate 50000 slots in multiple SLURM jobs, manually wait they are all running at the same time and then mpirun your app outside of SLURM. this is both ugly and unefficient, and you might get in trouble if this is seen as an attempt to circumvent the system limits.
if 2) or 3), then job array is a good candidate. and if i understand correctly your app, you will need an extra post processing step in order to concatenate/merge all your outputs in a single file.
and if you go with 3), you will need to find the sweet spot
(generally speaking, 50000 serial program are more efficient than fifty 1000 tasks MPI jobs or one 50000 tasks MPI program but merging 50000 file is less efficient than merging 50 files (or not merging anything at all)
and if you cannot do the post-processing on a frontend node, then you can use job dependency to start it after all the computation have completed

Opening a new instance of R and sourcing a script within that instance

Background/Motivation:
I am running a bioinformatics pipeline that, if executed from beginning to end linearly takes several days to finish. Fortunately, some of the tasks don't depend upon each other so they can be performed individually. For example, Task 2, 3, and 4 all depend upon the output from Task 1, but do not need information from each other. Task 5 uses the output of 2, 3, and 4 as input.
I'm trying to write a script that will open new instances of R for each of the three tasks and run them simultaneously. Once all three are complete I can continue with the remaining pipeline.
What I've done in the past, for more linear workflows, is have one "master" script that sources (source()) each task's subscript in turn.
I've scoured SO and google and haven't been able to find a solution for this particular problem. Hopefully you guys can help.
From within R, you can run system() to invoke commands within a terminal and open to open a file. For example, the following will open a new terminal instance:
system("open -a Terminal .",wait=FALSE)
Similarly, I can start a new r session by using
system("open -a r .")
What I can't figure out for the life of me is how to set the "input" argument so that it sources one of my scripts. For example, I would expect the following to open a new terminal instance, call r within the new instance, and then source the script.
system("open -a Terminal .",wait=FALSE,input=paste0("r; source(\"/path/to/script/M_01-A.R\",verbose=TRUE,max.deparse.length=Inf)"))
Answering my own question in the event someone else is interested down the road.
After a couple of days of working on this, I think the best way to carry out this workflow is to not limit myself to working just in R. Writing a bash script offers more flexibility and is probably a more direct solution. The following example was suggested to me on another website.
#!/bin/bash
# Run task 1
Rscript Task1.R
# now run the three jobs that use Task1's output
# we can fork these using '&' to run in the background in parallel
Rscript Task2.R &
Rscript Task3.R &
Rscript Task4.R &
# wait until background processes have finished
wait %1 %2 %3
Rscript Task5.R
You might be interested in the future package (I'm the author). It allows you to write your code as:
library("future")
v1 %<-% task1(args_1)
v2 %<-% task2(v1, args_2)
v3 %<-% task3(v1, args_3)
v4 %<-% task4(v1, args_4)
v5 %<-% task5(v2, v3, v4, args_5)
Each of those v %<-% expr statements creates a future based on the R expression expr (and all of it's dependencies) and assigns it to a promise v. It is only when v is used, it will block and wait for the value v to be available.
How and where these futures are resolved is decided by the user of the above code. For instance, by specifying:
library("future")
plan(multiprocess)
at the top, then the futures (= the different tasks) are resolved in parallel on your local machine. If you use,
plan(cluster, workers = c("n1", "n3", "n3", "n5"))
they're resolved on those for machine (where n3 accepts two concurrent jobs).
This works on all operating systems (including Windows).
If you have access to a HPC compute with schedulers such as Slurm, SGE, and TORQUE / PBS, you can use the future.BatchJobs package, e.g.
plan(future.BatchJobs::batchjobs_torque)
PS. One reason for creating future was to do large-scale Bioinformatics in parallel / distributed.

Resources