Is there any TriggerRule for airflow operator with no Trigger by time (need to - airflow

I have use case to create 2 tasks of BigqueryOperator that have same destination table but I need one to run daily, and the second one to be run manually just when I need.
Below are the illustration of Tree View
| task_3rd_adhoc
| task_3rd
|---- task_2nd
|---- task_1st_a
|---- task_1st_b
From example above, DAG are run daily. And I aim to the task will be:
task_1st_a and task_1st_b run first. Target table are:
project.dataset.table_1st_a with _PARTITIONTIME = execution date, and
project.dataset.table_1st_b with _PARTITIONTIME = execution date.
then task_2nd_a will run after task_1st_a and task_1st_b finish. BigQueryOperator use TriggerRule.ALL_SUCCESS. Target table is:
project.dataset.table_2nd with _PARTITIONTIME = execution date.
then task_3rd will run after task_2nd success. BigQueryOperator use TriggerRule.ALL_SUCCESS. Target table is:
project.dataset.table_3rd with PARTITIONTIME = D-2 from execution date.
task_3rd_adhoc will not run in daily job. I need this when I want to backfill table project.dataset.table_3rd. With target table:
project.dataset.table_3rd with _PARTITIONTIME = execution_date
But I still can't find what is the correct TriggerRule for step #4 above. I tried TriggerRule.DUMMY because I thought it can be used to set no Trigger, but task_3rd_adhoc also run in daily job when I tried create DAG above.
(based on this doc dependencies are just for show, trigger at will)

First of all, you've misunderstood TriggerRule.DUMMY.
Usually, when you wire tasks together task_a >> task_b, B would run only after A is complete (success / failed, based on B's trigger_rule).
TriggerRule.DUMMY means that even after wiring tasks A & B together as before, B would run independently of A (run at will). It doesn't mean run at your will, rather it runs at Airflow's will (it will trigger it whenever it feels like). So clearly tasks having dummy trigger rule will pretty much ALWAYS run, albeit, at an unpredictable time
What you need here (to have a particular task in DAG always but run it only when manually specified) is a combination of
AirflowSkipException
Variable
Here's roughly how you can do
A Variable should hold the command for this task (whether or not it should run). This Variable, of course, you can edit anytime from UI (thereby controlling whether or not that task runs in next DagRun)
In the Operator's code (execute() method for custom-operator or just python_callable in case of PythonOperator), you'll check value of Variable (whether or not the task is supposed to run)
Based on the Variable value, if the task is NOT supposed to run, you must throw an AirflowSkipException, so that the task will be marked at skipped. Or else, it will just run as usual

Related

run next tasks in dag if another dag is complete

dag1:
start >> clean >> end
I have a dag where i run a few tasks. But I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
Is there any way I can import information regarding my "dag2", check its status and if it is in success mode, I can proceed to the clean step
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve the wait_for_dag2 part?
There are some different answers depends on what you want to do:
if you have two dags with the same schedule interval, and you want to make the run of second dag waits the same run of first one, you can use ExternalTaskSensor on the last task of the first dag
if you want to run a dag2, after each run of a dag1 even if it's triggered manually, in this case you need to update dag1 and add a TriggerDagRunOperator and set schedule interval of the second to None
I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
if you have two dags and you don't want to run them in same time to avoid a conflict on an external server/service, you can use one of the first two propositions or just use higher priority for the task of the first dag, and use the same pool (with 1 slot) for the tasks which lead to the conflict, but you will lose the parallelism on these tasks.
Hossein's Approach is the way people usually go. However if you want to get info about any dag run data, you can use the airlfow functionality to get that info. The following appraoch is good when you do not want(or are not allowed) to modify another dag:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_you_want_to_check')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)

Can I create an adhoc parameterized DAG from scheduler job DAG running once a minute

I'm researching Airflow to see if it is a viable fit for my use case and not clear from the documentation if it fits this scenario. I'd like to schedule a job workflow per customer based on some very dynamic criteria which doesn't fall into the standard "CRON" loop of running every X minutes etc. (since there is some impact of running together)
Customer DB
Customer_id, "CRON" desired interval (best case)
1 , 30 minutes
2 , 2 hours
...
... <<<<<<< thousands of these potential jobs
Every minute I'd like to query the state of the current work in the system as well as real world "sensor" data which changes often (such as load on some DBs, or quotas to other resources, or adhoc priorities, boosting etc.)
When decided, I'd need to create a "DAG" (pipeline) of work per customer which had been deemed worthy of running at this time (since perhaps we want to delay work for the "CRON" given some complicated analysis).
For instance :
Every minute run this test:
for customer in DB:
if (shouldRunDAGForCustomer(customer)):
Create a DAG with states ..... and run it
"Sleep for a minute"
def shouldRunDagForCustomer(...):
currentStateOfJobs = ....
situationalAwarenessOfWorld = .... // check all sort of interesting metrics
if some criteria is met for this customer: return true // run the dag for this customer
else : return false
From the material I've read, it seems that the Dags are given a specifed schedule and are static in their structure. Also seems that DAGs run on all their inputs, not generated per input.
Also wasn't clear on how the scheduling works, if the given DAG hadn't completed but the schedule time had arrived. Would I have potentially multiple runs of the same pipeline for the same input (Bad). As I have pipelines whose time to complete varies depending on customer, dynamic load of system etc. I'd like to manage the scheduling aspect myself and generation of "DAG"
This is possible through a "controller" DAG that is scheduled every minute, which then triggers runs for a "target" DAG when desired conditions are met. Airflow has a good example of this, see example_trigger_controller_dag and example_trigger_target_dag. The controller uses the TriggerDagRunOperator() which is an operator that kicks off a run for any desired DAG.
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="example_trigger_target_dag", # Ensure this equals the dag_id of the DAG to trigger
conf={"message": "Hello World"},
dag=dag,
)
Then the target DAG doesn't need to do anything special except should have schedule_interval=None. Note that on trigger, you can populate a conf dictionary that the target can later consume, in case you want to customize each triggered run.
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: \'{{ dag_run.conf["message"] if dag_run else "" }}\'"',
dag=dag,
)
Back to your case, your scenario is similar, but where you differ from the example is you won't kick off a DAG every time and you have multiple target DAGs that you could kick off. This is where the ShortCircuitOperator comes into play, which basically is a task that runs a python method you specify, which just needs to return true or false. If it returns true, then it continues onto the next downstream task as usual, otherwise it "short circuits" and stops skips the downstream task. Worth giving example_short_circuit_operator a run if you want to see this demonstrated. With that and dynamic creation of tasks with a for-loop, I think you'll get something like this in your controller DAG:
dag = DAG(
dag_id='controller_customer_pipeline',
default_args=args,
schedule_interval='* * * * *',
)
def shouldRunDagForCustomer(customer, ...):
currentStateOfJobs = ....
situationalAwarenessOfWorld = .... // check all sort of interesting metrics
if some criteria is met for this customer: return true // run the dag for this customer
else : return false
for customer in DB:
check_run_conditions = ShortCircuitOperator(
task_id='check_run_conditions_' + customer,
python_callable=shouldRunDagForCustomer,
op_args=[customer],
op_kwargs={...}, # extra params if needed
dag=dag,
)
trigger_run = TriggerDagRunOperator(
task_id='trigger_run_' + customer,
trigger_dag_id='target_customer_pipeline_' + customer, # standardize on DAG ids for per customer DAGs
conf={"foo": "bar"}, # pass on extra info to triggered DAG if needed
dag=dag,
)
check_run_conditions >> trigger_run
Then your target DAG is just the per customer work.
This is probably not the only way you could implement something like this, but basically yes I think it's viable to implement in Airflow.

When does a airflow dag definition get evaluated?

Suppose I have an airflow dag file that creates a graph like so...
def get_current_info(filename)
current_info = {}
<fill in info in current_info relevant for today's date for given file>
return current_info
files = [
get_current_info("file_001"),
get_current_info("file_002"),
....
]
for f in files:
<some BashOperator bo1 using f's current info dict>
<some BashOperator bo2 using f's current info dict>
....
bo1 >> bo2
....
Since these values in the current_info dict that is used to define the dag changes periodically (here, daily), I would like to know by what process / schedule the dag definition gets updated. (I print the current_info values each run and values appear to be updating, but curious as to how and when exactly this happens).
When does a airflow dag definition get evaluated? referenced anywhere in the docs?
The DAGs are evaluated in every run of the scheduler.
This article describes how the scheduler works and at what stage the DAG files are picked up for evaluation.
After some discussion on the [airflow email list][1], it turns out that airflow builds the dag for each task when it is run (so each tasks includes the overhead of building the dag again (which in my case was very significant)).
See more details on this here: https://stackoverflow.com/a/59995882/8236733

Apache Airflow: rerun for tasks with date parameters

I have a hourly shell script job that takes a date and hour as input params. The date and hour are used to construct the input path to fetch data for the logic contained in the job DAG. When a job fails and I need to rerun it (by clicking "Clear" for the failed task node to clean up the status to re-trigger a new run), how can I make sure the date and hour used for rerun are the same as the failed run since the rerun could happen in a different hour as the original run?
You have 3 options:
Hover to the failed task which is going to clear, in its displaying tag there will be a value with key Run:, it is its Execution date and time.
Click on the failed task which is going to clear, heading of its displaying popup which has the clear option will be [taskname] on [executiondatewithtime]
Open the task log, the first line after the attempts count will be included a string with format Executing <Task([TaskName]): task_id> on [ExecutionDate withTime]

Manual DAG run set individual task state

I have a DAG without a schedule (it is run manually as needed). It has many tasks. Sometimes I want to 'skip' some initial tasks by changing the task state to SUCCESS manually. Changing task state of a manually executed DAG fails, seemingly because of a bug in parsing the execution_date.
Is there another way to individually setting task states for a manually executed DAG?
Example run below. The execution date of the Task is 01-13T17:27:13.130427, and I believe the milliseconds are not being parsed correctly.
Traceback
Traceback (most recent call last):
File "/opt/conda/envs/jumpman_prod/lib/python3.6/site-packages/airflow/www/views.py", line 2372, in set_task_instance_state
execution_date = datetime.strptime(execution_date, '%Y-%m-%d %H:%M:%S')
File "/opt/conda/envs/jumpman_prod/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
tt, fraction = _strptime(data_string, format)
File "/opt/conda/envs/jumpman_prod/lib/python3.6/_strptime.py", line 365, in _strptime
data_string[found.end():])
ValueError: unconverted data remains: ..130427
It's not working from Task Instances page, but you can do it in another page:
- open DAG graph view
- select needed Run (screen 1) and click go
- select needed task
- in a popup window click Mark success (screen 2)
- then confirm.
PS it relates to airflow 1.9 version
Screen 1
Screen 2
What you may want to do to accomplish this is using branching, which, as the name suggests, allows you to follow different execution paths according to some conditions, just like an if in any programming language.
You can use the BranchPythonOperator (documented here) to attain this goal: the idea is that this operator is configured by a python_callable, a function that outputs the task_id to execute next (which should, of course, be a task which is directly downstream from the BranchPythonOperator itself).
Using branching will set the skipped tasks to the proper state automatically, as mentioned in the documentation:
All other “branches” or directly downstream tasks are marked with a state of skipped so that these paths can’t move forward. The skipped states are propagated downstream to allow for the DAG state to fill up and the DAG run’s state to be inferred.
The resulting DAG would look something like the following:
(source: apache.org)
Branching is documented here, on the official Apache Airflow documentation.

Resources