Auto scaling task in airflow - airflow

I want to use airflow for image processing.
I have 4 Tasks: Image Pre process (A) ,bounding box finder (B), classification (C), image finalize (D).
the chart look like this:
A -> B1 -> C \
-> B2 -> C - D
-> B3 -> C /
-> Bn -> C /
the output of Image Pre process task is a list of bounding box proposals, for each bounding box I run classification and once all classification tasks ends I run the image finalize.
I want everything to run in parallel
This will run on 10000 images per day so if I will have different presentation of pipeline in the UI for each image, I can't keep track of the pipeline...
Is it possible in airflow ?

Dynamically creating tasks like this is not something Airflow is best for. Take a look at the answer here to get some insight: Airflow dynamic tasks at runtime.
Airflow is better suited as a scheduling tool, so I propose you delegate the actual work and parallelization to another tool like Celery. You can still use Airflow to schedule this work, in a way that your B step is a simple operator which reads the output from A (via XCom or similar) and distributes actual work to some remote workers.
Can you know in advance the maximum possible number of B tasks? If that's manageable, you could get away with creating the max B tasks, and then skipping some of them as needed depending on the outcome of A.
The implementation might not be trivial, but you could get some hints from this discussion: Launch a subdag with variable parallel tasks in airflow.

Related

Julia parallel processing on PBS multiple nodes

I am looking for a way to run simple parallel processes (one function run multiple times with different arguments, no communication between process) across multiple nodes in a PBS cluster.
Currently I am able to run it on a single node setting the number of threads with an environment variable in the PBS script, and using a for loop with #thread.threads
I have found references to clustermanager.jl, but no clear working example on how to use it on PBS.
For example: does addprocs_pbs in the file take care also of the script part, or do I still need to run a pbs script as usual, and this function is called inside the julia file?
This is the code structure I am using now. Ideally, it would stay more or less the same but parallel process could run across multiple nodes.
using JLD
include("path/to/library/with/function.jl")
seed = 342;
n = 18; # number of simulations
changing_parameter = [1,2,3,4];
input_file = "some file"
CSV.read(string(input_files_folder,input_file));
# I should also parallelise this external for loop
# it currently runs 18 simulations per run, and saves the results each time
for P in changing_parameter
Random.seed!(seed);
seeds = rand(1:100000,n)
results = []
Threads.#threads for i = 1:n
push!(results,function(some_fixed_parameters, P=P, seed=seeds[i]);)
end
# get the results
# save the results
JLD.save(filename,to_save,compress=true)
end
For distributed computing you normally need to use multiprocessing rather than multi-threading (although it is OK to have multi-threaded parallel processes if you need).
Hence, what you need to do is to use the ClustersManagers library to use the cluster manager to allocate processes for your Julia cluster.
I have been using Julia with Cray clusters using SLURM so not exactly PBS, however I since your question remain unanswered here is my working code. You will use addprocs_pbs that looks to have a very similar structure.
using ClusterManagers
addprocs_slurm(36,job_name="jobname", account="some_acc_name", time="01:00:00", exename="/lustre/tetyda/home/pszufe/julia/usr/bin/julia")
Once you add the worker processes all what remains is to use the Distributed package to orchestrate your workload.

Is hierachical parallelism possible with MPI libraries?

I'm writing a computational code with MPI. I have a few parts of the software each compute different part of the problem. Each part is written with MPI thus could be run as an independent module. Now I want to combine these parts to be run together within one program, and all parts of the code run in parallel while each part itself is also running in parallel.
e.g. Total number of nodes = 10, part1 run with 6 nodes and part 2 run with 4 nodes and both running together.
Is there ways that I can mpirun with 10 nodes and mpi_init each part with desired number of node without rewritten the overall program to allocate process for each part of code?
This is not straightforward.
One option is to use an external program that with MPI_Comm_spawn() (twice) your sub-programs. The drawback is this requires one slot.
An other option needs some rewriting, since all the tasks will end up in the same MPI_COMM_WORLD, it is up to them to MPI_Comm_split() based on who they are, and use the resulting communicator instead of MPI_COMM_WORLD.

nvprof R gputools code never ends

I am trying to run "nvprof" from command line on R. Here is how I am doing it:
./nvprof --print-gpu-trace --devices 0 --analysis-metrics --export-profile /home/xxxxx/%p R
This gives me a R prompt and i write R code. I can do with Rscript too.
Problem i see is when i give --analysis-metrics option it gives me lots of lines similar to
==44041== Replaying kernel "void ger_kernel(cublasGerParams)"
And R process never ends. I am not sure what I am missing.
nvprof doesn't modify process exit behavior, so I think you're just suffering from slowness because your app invokes a lot of kernels. You have two options to speed this up.
1. Selectively profiling metrics
The --analysis-metrics option enables collection of a number of metrics, which requires kernels to be replayed - collecting a different set of metrics for each kernel run.
If your application has a lot of kernel invocations, this can take time. I'd suggest you query the available metrics with the nvprof --query-metrics command, and then manually choose the metrics you are interested in.
Once you know which metrics you want, you can query them using nvprof -m metric_1,metric_2,.... This way, the application will profile less metrics, hence requiring less replays, and running faster.
2. Selectively profiling kernels
Alternatively, you can only profile a specific kernel using the --kernels <context id/name>:<stream id/name>:<kernel name>:<invocation> option.
For example, nvprof --kernels ::foo:2 --analysis-metrics ./your_cuda_app will profile all analysis metrics for the kernel whose name contains the string foo, and only on its second invocation. This option takes regular expressions, and is quite powerful.
You can mix and match the above two approaches to speed up profiling. You will be able to find more help about these and other nvprof options using the command nvprof --help.

Best way to pass local variables to ipyparallel cluster

I'm running a simulation in an ipython notebook that is composed of seven functions that are dependent of each other, and requires 13 different parameters. Some of the functions are called within other functions to allow one function to run the entire simulation. The simulation involves manipulating two parameters for a total of >20k iterations. Two simulations can be run asynchronously. Since each iteration is taking ~1.5 seconds, I'm investigating parallel processing.
When I first tried ipyparallel, I got a global name not defined error. Makes sense that local objects can't been found a worker. In an effort to avoid spending quite a bit of time going down a rabbit hole, what would be the easiest way to pass a whole bunch of objects to all of the workers? Are there other gotchas to consider when using ipyparallel in this way?
There is a bit more detail in this related question, but the gist is: interactively defined modules resolve in the interactive namespace (__main__), which is different on the engine and client. You can send functions to the engine with view.push(dict(func=func, func2=func2)), in which case they will be found. The alternative is to define your functions in a module or package that you ensure is installed on all the engines.
For instance, in a script:
def bar(x):
return x * x
def foo(y):
return bar(y)
view.apply(foo, 5) # NameError on bar
view.push(dict(bar=bar)) # send bar
view.apply(foo, 5) # 25
Often when using IPython parallel from a notebook or larger script, one of the early steps is seeding the namespace of the engines:
rc[:].push(dict(
f1=f1,
f2=f2,
const=const,
))
If you have more than a few names to push this way, it might be time to consider defining these functions in a module, and distributing that instead.

Is it possible, in R parallel::mcparallel, to limit the number of cores used at any one time?

In R, the mcparallel() function in the parallel package forks off a new task to a worker each time it is called. If my machine has N (physical) cores, and I fork off 2N tasks, for example, then each core starts off running two tasks, which is not desirable. I would rather like to be able to start running N tasks on N workers, and then, as each tasks finishes, submit the next task to the now-available core. Is there an easy way to do this?
My tasks take different amounts of time, so it is not an option to fork off the tasks serial in batches of N. There might be some workarounds, such as checking the number of active cores and then submitting new tasks when they become free, but does anyone know of a simple solution?
I have tried setting cl <- makeForkCluster(nnodes=N), which does indeed set N cores going, but these are not then used by mcparallel(). Indeed, there appears to be no way to feed cl into mcparallel(). The latter has an option mc.affinity, but it's unclear how to use this and it doesn't seem to do what I want anyway (and according to the documentation its functionality is machine dependent).
you have at least 2 possibilities:
As mentioned above you can use mcparallel's parameters "mc.cores" or "mc.affinity".
On AMD platforms "mc.affinity" is preferred since two cores share same clock.
For example an FX-8350 has 8 cores, but core 0 has same clock as core 1. If you start a task for 2 cores only it is better to assign it to cores 0 and 1 rather than 0 and 2. "mc.affinity" makes that. The price is loosing load balancing.
"mc.affinity" is present in recent versions of the package. See changelog to find when introduced.
Also you can use OS's tool for setting affinity, e.g. "taskset":
/usr/bin/taskset -c 0-1 /usr/bin/R ...
Here you make your script to run on cores 0 and 1 only.
Keep in mind Linux numbers its cores starting from "0". Package parallel conforms to R's indexing and first core is core number 1.
I'd suggest taking advantage of the higher level functions in parallel that include this functionality instead of trying to force low level functions to do what you want.
In this case, try writing your tasks as different arguments of a single function. Then you can use mclapply() with the mc.preschedule parameter set to TRUE and the mc.cores parameter set to the number of threads you want to use at a time. Each time a task finishes and a thread closes, a new thread will be created, operating on the next available task.
Even if each task uses a completely different bit of code, you can create a list of functions and pass that to a wrapper function. For example, the following code executes two functions at a time.
f1 <- function(x) {x^2}
f2 <- function(x) {2*x}
f3 <- function(x) {3*x}
f4 <- function(x) {x*3}
params <- list(f1,f2,f3,f4)
wrapper <- function(f,inx){f(inx)}
output <- mclapply(params,FUN=calling,mc.preschedule=TRUE,mc.cores=2,inx=5)
If need be you could make params a list of lists including various parameters to be passed to each function as well as the function definition. I've used this approach frequently with various tasks of different lengths and it works well.
Of course, it may be that your various tasks are just different calls to the same function, in which case you can use mclapply directly without having to write a wrapper function.

Resources