I have a data processing pipeline with several stages, currently using a custom driver and pipeline framework. I'd really like to use something like make, but I would need to use the parallel capabilities for efficiency.
My question is, is there any configuration that could be done to the workers used by make -j?
For example, if the user runs make -j8, I'd like each of the 8 processes to use a slightly different environment.
Currently I have a custom setup using MATLAB, and I know MATLAB's parallel processing toolbox allows worker setup/teardown features.
Here's an example:
all: t1 t2 t3 t4 t5 t6 t7 t8 t9
t%:
$APP args
Where $APP is different for every process spawned by -j.
Why would I want this? In this case, I have an $APP that can't run more than once simultaneously, so I want to create a pool of them: $APP1, $APP2, $APP3, etc. and distribute make jobs to them.
I'm not sure what you have in mind, but make will run a different recipe for each target that it builds, regardless of whether they're run in parallel or not. Each one can have as many differences as you'd like to supply. For example:
all: t1 t2 t3 t4 t5 t6 t7 t8 t9
t%:
TARGET=$# run some command
The value of TARGET will be set to the name of the target that's being built.
Related
I am trying to execute multiple similar tasks for different sets in parallel, but only want to run some of them while making other tasks group wait for completion. For example if I have 5 task groups, I want to run 3 of these groups in parallel and only trigger the others if one of them complete. Basically only have 3 running parallel at a time. What's the best way to do that?
Why don't you use ExternalTaskOperator, what it does is bound a task to not run until its parent task is finished you can search on airflow documentation.
We are using airflow 2.00. I am trying to implement a DAG that does two things:
Trigger Reports via API
Download reports from source to destination.
There needs to at least 2-3 hours gap between tasks 1 and 2. From my research I two options
Two DAGs for two tasks. Schedule the 2nd DAG two hour apart from 1st DAG
Delay between two tasks as mentioned here
Is there a preference between the two options. Is there a 3rd option with Airflow 2.0? Please advise.
The other option would be to have a sensor waiting for the report to be present. You can utilise reschedule mode of sensors to free up workers slots.
generate_report = GenerateOperator(...)
wait_for_report = WaitForReportSensor(mode='reschedule', poke_interval=5 * 60, ...)
donwload_report = DonwloadReportOperator(...)
generate_report >> wait_for_report >> donwload_report
A third option would be to use a sensor between two tasks that waits for a report to become ready. An off-the-shelf one if there is one for your source, or a custom one that subclasses the base sensor.
The first two options are different implementations of a fixed waiting time. Two problems with it: 1. What if the report is still not ready after the predefined time? 2. Unnecessary waiting if the report is ready earlier.
I know that priority_weight can be set for the DAG in the default_args as per the example in the official documentation here.
Can we also set priority_weight that is different for each task in the DAG?
Following the example from the tutorial, it would mean that t1 would have a different priority from t2.
Can we also set priority_weight that is different for each task in the
DAG?
Short Answer
Yes
Long Version
You appear a little confused here. Citing the passage above the snippet in the given link:
..we have the choice to explicitly pass a set of arguments to each
task’s constructor (which would become redundant), or (better!) we can
define a dictionary of default parameters that we can use when
creating tasks..
So now you must have inferred that The priority_weight that was being passed in default_args was actually meant for individual tasks and not the DAG itself. Of course looking at the code it becomes clear that it's a parameter of BaseOperator and not DAG SQLAlchemy model
Also once you get to know the above fact, you'll soon realize that it wouldn't make much sense to assign same priority to each task of DAG. The said example from the official docs indeed appears to have overlooked this simple reasoning (unless I'm missing something). Nevertheless the docstring does seem to indicate so
:param priority_weight: priority weight of this task against other task.
This allows the executor to trigger higher priority tasks before
others when things get backed up.
UPDATE-1
As rightly pointed out by #Alessandro S. in comments, assigning same priority_weight to all tasks within a DAG is NOT unreasonable after all since priority_weight is not enforced on DAG-level but on pool level
So when you take 2 (or more) dags into picture (both accessing same external resource) then a valid use-case could be that you want to promote all tasks of one dag over other one
To realize this, all tasks of first dag can be a single value of priority_weight which is higher than that of tasks in second dag.
In a DAG when you have multiple tasks, with upstream-downstream flow scheduler will by default assign different priorities to different tasks.
t1 >> t2
For above mentioned scenario, t1 will get higher priority weight than t2. However, t1 and t2 will have same priorities by default for following use case :
[t1, t2]
If you want to have more sophisticated and customized priority assignments. You should play around with priority_weight (DAG parameter) and weight_rule (Task parameter).
priority_weight can be used to prioritize all the instances of certain DAG over other DAGs.
weight_rule handles prioritization at task level.
For example DAG D1 has two tasks t1 and t2. t2 depends upon some parameters acquired during t1, meaning t2 should be downstream task to t1.
t1 >> t2
However, when there are multiple instances of D1 running at the same time, scheduler will run all t1 instances first and then move on to t2 instances. This is simply because by default t1 gets higher priority weight than t2. But what if you want to prioritize finishing of DAG. Meaning t2 should run after finishing corresponding t1 in all DAG instances. Then you can use weight_rule='upstream' to explicitly set t2 at higher priority weight than t1.
More information here
We have multi-operator dags in our airflow implementation.
Lets say dag-a has operator t1, t2, t3 which are set up to run sequentially (ie. t2 is dependent on t1, and t3 is dependent on t2.)
task_2.set_upstream(task_1)
task_3.set_upstream(task_2)
We need to insure that when dag-a is instantiated, all its tasks complete successfully before another instance of the same dag is instantiated (or before the first task on the next dag instance is triggered.)
we have set the following in our dags:
da['depends_on_past'] = True
What is happening right now is that if the instantiated dag does not have any errors, we see the desired effect.
However, lets say dag-a is scheduled to run hourly. On the hour dag-a-i1 instance is trigged as scheduled. Then dag-a-i1 task t1 runs successfully and then t2 starts running and fails. In that scenario , we see dag-a-i1 instance stops as expected. when the next hour comes, we see dag-a-i2 instance is triggered and we see task t1 for that dag instance (i2) starts running and lets say completes, and then the dag-a-i2 stops, since its t2 can not run because previous instanse of t2 (for dag-a-i1) has failed status.
The behavior we need to see is that the second instance not get triggered, or if it gets triggered, we do not want to see task t1 for that second instance get triggered. This is causing problem for us.
Any help is appreciated.
Before I begin to answer, I'm going to set up lay out a naming conventions that will differ from the one you presented in your question.
DagA.TimeA.T1 will refer to an instance of a DAG A executing task T1 at time A.
Moving on, I see two potential solutions here.
The first:
Although not particularly pretty, you could add a sensor task to the beginning of your DAG. This sensor should wait for the execution of the final task of the same DAG. Something like the following should work:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.sensors import ExternalTaskSensor
from datetime import timedelta
dag = DAG(dag_id="ETL", schedule_interval="#hourly")
ensure_prior_success = ExternalTaskSensor(external_dag_id="ETL",
external_task_id="final_task", execution_delta=timedelta(hours=1))
final_task = DummyOperator(task_id="final_task", dag=dag)
Written this way, if any of the non-sensor tasks fail during a DagA.TimeA run, DagA.TimeB will begin executing its sensor task but will eventually timeout.
If you choose to write your DAG in this way, there are a couple things you should be aware of.
If you are planning on performing backfills of this DAG (or, if you think you ever may), you should set your DAG's max_active_runs to a low number. The reason for this is that a large enough backfill could fill the global task queue with sensor tasks and create a situation where new tasks are be unable to be queued.
The first run of this DAG will require human intervention. The human will need to mark the initial sensor task as success (because no previous runs exist, the sensor cannot complete successfully).
The second:
I'm not sure what work your tasks are performing, but for sake of example let's say they involve writes to a database. Create an operator that looks at your database for evidence that DagA.TimeA.T3 completed successfully.
As I said, without knowing what your tasks are doing, it is tough to offer concrete advice on what this operator would look like. If your use-case involves a constant number of database writes, you could perform a query to count the number of documents that exist in the target table WHERE TIME <= NOW - 1 HOUR.
I'm trying to set1 an autosys jobs configuration so that will have a "funnel" job queue behavior, or, as I call it, in a 'waterdrops' pattern, each job executing in sequence after a given time interval, with local job failure not cascading into sequence failure.
1 (ask for it to be setup, actually, as I do not control the Autosys machine)
Constraints
I have an (arbitrary) N jobs (all executing on success of job A)
For this discussion, lets say three (B1, B2, B3)
Real production numbers might go upward of 100 jobs.
All these jobs won't be created at the same time, so addition of a new job should be as less painful as possible.
None of those should execute simultaneously.
Not actually a direct problem for our machine
But side effect on a remote, client machine : jobs include file transfer, which are trigger-listened to on client machine, which doesn't handle well.
Adaptation of client-machine behavior is, unfortunately, not possible.
Failure of job is meaningless to other jobs.
There should be a regular delay in between each job
This is a soft requirement in that, our jobs being batch scripts, we can always append or prepend a sleep command.
I'd rather, however have a more elegant solution especially if the delay is centralised : a parameter - that could be set to greater values, should the need arise.
State of my reasearch
Legend
A(s) : Success status of job
A(d) : Done status of job
Solution 1 : Unfailing sequence
This is the current "we should pick this solution" solution.
A (s) --(delay D)--> B(d) --(delay D)--> B2(d) --(delay D)--> B3 ...
Pros :
Less bookeeping than solution 2
Cons :
Bookeeping of the (current) tailing job
Sequence doesn't resist to job being ON HOLD (ON ICE is fine).
Solution 2 : Stairway parallelism
A(s) ==(delay D)==> B1
A(s) ==(delay D x2)==> B2
A(s) ==(delay D x3)==> B3
...
Pros :
Jobs can be put ON HOLD without incidence.
Cons :
Bookeeping to know "who is when" (and what's the next delay to implement)
N jobs executed at the same time
Underlying race condition created
++ Risk of overlap of job execution, especially if small delays accumulates
Solution 3 : The Miracle Box ?
I have read a bit about Job Boxes, but the specific details eludes me.
-----------------
A(s) ====> | B1, B2, B3 |
-----------------
Can we limit the number of concurrent executions of jobs of a box (i.e a box-local max_load, if I understand that parameter) ?
Pros :
Adding jobs would be painless
Little to no bookeeping (box name, to add new jobs - and it's constant)
Jobs can be put ON HOLD without incidence (unless I'm mistaken)
Cons :
I'm half-convinced it can't be done (but that's why I'm asking you :) )
... any other problem I have failed to forseen
My questions to SO
Is Solution 3 a possibility, and if yes, what are the specific commands and parameters for implementing it ?
Am I correct in favoring Solution 1 over Solution 2 otherwise2 ?
An alternative solution fitting in the constraints is of course more than welcome!
Thanks in advance,
Best regards
PS: By the way, is all of this a giant race condition manager for the remote machine failing behavior ?
Yes, it is.
2 I'm aware it skirts a bit toward the "subjective" part of questions rejection rules, but I'm asking it in regards to the solution(s) correctness toward my (arguably) objective constraints.
I would suggest you to do below
Put all the jobs (B1,B2,B3) in a box job B.
Create another job (say M1) which would run on success of A. This job will call a shell/perl script (say forcejobs.sh)
The shell script will get a list of all the jobs in B and start a loop with a sleep interval of delay period. Inside loop it would force start jobs one by one after the delay period.
So outline of script would be
get all the jobs in B
for each job start for loop
force start the job
sleep for delay interval
At the end of the loop, when all jobs are successfully started, you can use an infinite loop and keep checking status of jobs. Once all jobs are SU/FA or whatever, you can end the script and send the result to you/stdout and finish the job M1.