we have created a task for sensor operation, but the task name will be dynamically updated. i.e., f"{table_name}_s3_exists". We have a scenario where we have to check a table's location twice, but if the task is still present, we don't have to create the sensor. Is there a way to find whether the task exists or not within the DAG during building ?
The CLI command
airflow tasks list [-h] [-S SUBDIR] [-t] [-v] dag_id
will give you list of all the dags.
https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#list_repeat6
You can also use the REST API to get the same info:
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/get_tasks
You could try the get_tasks endpoint in the Airflow REST API. The endpoint returns a lot of information for tasks in a given DAG.
I am in the process of converting a workflow I currently have, into Apache Airflow tasks. The workflow is as follows:
1. Run set of tests.
2. Check results
- If the tests pass then move to the next set of tests.
- If the tests fail, log the failure information (time, reason, what test was ran) in a database.
3. Send message out about failures.
Right now I have a PythonOperator task going that kicks off a callable that executes the set of tests based off information we have stored in MongoDB.
test_info_conn = MongoHook(conn_id='test_selector_mongo')
test_list = test_info_conn.find('test_selector_metadata', None).limit(2)
for count, current_test in enumerate(test_list):
status_test_task = PythonOperator(
task_id='status_test_'+str(count),
python_callable=run_status_tests,
op_kwargs={'current_test':current_test},
provide_context=True,
dag=dag)
This is the point where I am not sure how to proceed. Would I chain another task that has a callable to handle the results, and how do I get the results to that next task in the chain?
For branching the workflow (2nd step), you can employ a BranchPythonOperator
For sharing (log) information between tasks (2nd to 3rd step) Airflow has XCOMs (xcom_pull() function)
References
Concepts/Branching
Concepts/XCOMs
Is it possible to somehow extract task instance object for upstream tasks from context passed to python_callable in PythonOperator. The use case is that I would like to check status of 2 tasks immediately after branching to check which one ran and which one is skipped so that I can query correct task for return value via xcom.
Thanks
The current problem that I am facing is that I have documents in a MongoDB collection which each need to be processed and updated by tasks which need to run in an acyclic dependency graph. If a task upstream fails to process a document, then none of the dependent tasks may process that document, as that document has not been updated with the prerequisite information.
If I were to use Airflow, this leaves me with two solutions:
Trigger a DAG for each document, and pass in the document ID with --conf. The problem with this is that this is not the intended way for Airflow to be used; I would never be running a scheduled process, and based on how documents appear in the collection, I would be making 1440 Dagruns per day.
Run a DAG every period for processing all documents created in the collection for that period. This follows how Airflow is expected to work, but the problem is that if a task fails to process a single document, none of the dependent tasks may process any of the other documents. Also, if a document takes longer than other documents do to be processed by a task, those other documents are waiting on that single document to continue down the DAG.
Is there a better method than Airflow? Or is there a better way to handle this in Airflow than the two methods I currently see?
From the knowledge I gained in my attempt to answer this question, I've come to the conclusion that Airflow is just not the tool for the job.
Airflow is designed for scheduled, idempotent DAGs. A DagRun must also have a unique execution_date; this means running the same DAG at the exact same start time (in the case that we receive two documents at the same time is quite literally impossible. Of course, we can schedule the next DagRun immediately in succession, but this limitation should demonstrate that any attempt to use Airflow in this fashion will always be, to an extent, a hack.
The most viable solution I've found is to instead use Prefect, which was developed with the intention of overcoming some of the limitations of Airflow:
"Prefect assumes that flows can be run at any time, for any reason."
Prefect's equivalent of a DAG is a Flow; one key advantage of a flow that we may take advantage of is the ease of parametriziation. Then, with some threads, we're able to have a Flow run for each element in a stream. Here is an example streaming ETL pipeline:
import time
from prefect import task, Flow, Parameter
from threading import Thread
def stream():
for x in range(10):
yield x
time.sleep(1)
#task
def extract(x):
# If 'x' referenced a document, in this step we could load that document
return x
#task
def transform(x):
return x * 2
#task
def load(y):
print("Received y: {}".format(y))
with Flow("ETL") as flow:
x_param = Parameter('x')
e = extract(x_param)
t = transform(e)
l = load(t)
for x in stream():
thread = Thread(target=flow.run, kwargs={"x": x})
thread.start()
You could change trigger_rule from "all_success" to "all_done"
https://github.com/apache/airflow/blob/62b21d747582d9d2b7cdcc34a326a8a060e2a8dd/airflow/example_dags/example_latest_only_with_trigger.py#L40
And also could create a branch that processes failed documents with trigger_rule set to "one_failed" to move processes those failed documents somehow differently (e.g. move to a "failed" folder and send a notification)
I would be making 1440 Dagruns per day.
With a good Airflow architecture, this is quite possible.
Choking points might be
executor - use Celery Executor instead of Local Executor for example
backend database - monitor and tune as necessary (indexes, proper storage etc)
webserver - well, for thousands of dagruns, tasks etc.. perhaps only use webeserver for dev/qa environments, and not for production where you have higher rate of task/dagruns submissions. You could use cli etc instead.
Another approach is scaling out by running multiple Airflow instances - partition documents let's say to ten buckets, and assign each partition's documents to just one Airflow instance.
I'd process the heavier tasks in parallel and feed successful operations downstream. As far as I know, you can't feed successes asynchronously to downstream tasks, so you would still need to wait for every thread to finish until moving downstream but, this would still be well more acceptable than spawning 1 dag for each record, something in these lines:
Task 1: read mongo filtering by some timestamp (remember idempotence) and feed tasks (i.e. via xcom);
Task 2: do stuff in paralell via PythonOperator, or even better via K8sPod, i.e:
def thread_fun(ret):
while not job_queue.empty():
job = job_queue.get()
try:
ret.append(stuff_done(job))
except:
pass
job_queue.task_done()
return ret
# Create workers and queue
threads = []
ret = [] # a mutable object
job_queue = Queue(maxsize=0)
for thr_nr in appropriate_thread_nr:
worker = threading.Thread(
target=thread_fun,
args=([ret])
)
worker.setDaemon(True)
threads.append(worker)
# Populate queue with jobs
for row in xcom_pull(task_ids=upstream_task):
job_queue.put(row)
# Start threads
for thr in threads:
thr.start()
# Wait to finish their jobs
for thr in threads:
thr.join()
xcom_push(ret)
Task 3: Do more stuff coming from previous task, and so on
We have built a system that queries MongoDB for a list, and generates a python file per item containing one DAG (note: having each dag have its own python file helps Airflow scheduler efficiency, with it's current design) - the generator DAG runs hourly, right before the scheduled hourly run of all the generated DAGs.
Is it possible to pass parameters to Airflow's jobs through UI?
AFAIK, 'params' argument in DAG is defined in python code, therefore it can't be changed at runtime.
Depending on what you're trying to do, you might be able to leverage Airflow Variables. These can be defined or edited in the UI under the Admin tab. Then your DAG code can read the value of the variable and pass the value to the DAG(s) it creates.
Note, however, that although Variables let you decouple values from code, all runs of a DAG will read the same value for the variable. If you want runs to be passed different values, your best bet is probably to use airflow templating macros and differentiate macros with the run_id macro or similar
Two ways to change your DAG behavior:
Use Airflow variables like mentioned by Bryan in his answer.
Use Airflow JSON Conf to pass JSON data to a single DAG run. JSON can be passed either from
UI - manual trigger from tree view
UI - create new DAG run from browse > DAG runs > create new record
or from
CLI
airflow trigger_dag 'MY_DAG' -r 'test-run-1' --conf '{"exec_date":"2021-09-14"}'
Within the DAG this JSON can be accessed using jinja templates or in the operator callable function context param.
def do_some_task(**context):
print(context['dag_run'].conf['exec_date'])
task1 = PythonOperator(
task_id='task1_id',
provide_context=True,
python_callable=do_some_task,
dag=dag,
)
#access in templates
task2 = BashOperator(
task_id="task2_id",
bash_command="{{ dag_run.conf['exec_date'] }}",
dag=dag,
)
Note that the JSON conf will not be present during scheduled runs. The best use case for JSON conf is to override the default DAG behavior. Hence set meaningful defaults in the DAG code so that during scheduled runs JSON conf is not used.