I am in the process of converting a workflow I currently have, into Apache Airflow tasks. The workflow is as follows:
1. Run set of tests.
2. Check results
- If the tests pass then move to the next set of tests.
- If the tests fail, log the failure information (time, reason, what test was ran) in a database.
3. Send message out about failures.
Right now I have a PythonOperator task going that kicks off a callable that executes the set of tests based off information we have stored in MongoDB.
test_info_conn = MongoHook(conn_id='test_selector_mongo')
test_list = test_info_conn.find('test_selector_metadata', None).limit(2)
for count, current_test in enumerate(test_list):
status_test_task = PythonOperator(
task_id='status_test_'+str(count),
python_callable=run_status_tests,
op_kwargs={'current_test':current_test},
provide_context=True,
dag=dag)
This is the point where I am not sure how to proceed. Would I chain another task that has a callable to handle the results, and how do I get the results to that next task in the chain?
For branching the workflow (2nd step), you can employ a BranchPythonOperator
For sharing (log) information between tasks (2nd to 3rd step) Airflow has XCOMs (xcom_pull() function)
References
Concepts/Branching
Concepts/XCOMs
Related
I want to extract statistics on our Airflow processes using its database. One of these statistics is how many DAG runs are finished smoothly, without any failures and reruns. Doing that using the try_number column of the dag_run table doesn't help, since it also counts automatic retries. I want to count only the cases in which an engineer had to rerun or resume the DAG run.
Thank you.
If I understand correctly you want to get all Dagruns that never had a failed task in them? You can do this by excluding all DAG run_id s that have an entry in the task_failed table:
SELECT *
FROM dag_run
WHERE run_id NOT IN (
SELECT run_id
FROM dag_run
JOIN task_fail USING(run_id)
)
;
This of course would not catch other task states that an engineer might intervene with like marking a task as successful that is stuck in running or a deferred state etc.
One note as of Airflow 2.5.0 you can add notes to DAG and task runs in the UI when manually intervening. Those notes are stored in the tables dag_run_note and task_instance_note.
Recently I'm developing an airflow pipeline that will be running for multi tenants. This DAG will be triggered via API, and separated by batches, which is controlled by a metadabase in SQL following some business rules.
Each batch has a batch_id in order to controll the batches, and it is passed to conf DAG via API. The batch id has the timestamp of creation combined with tenant and filetype. Example: tenant1_20221120123323 ... tenant2_20221120123323. These batches can contain two filetypes ( for example purpouses) and for each filetype a DAG is triggered (DAG1 for filetype 1 and DAG2 for filetype 2) and then from the file perspective, it is combined with the filetype in some stages tenant1_20221120123323_filetype1, tenant1_20221120123323_filetype2 ...
For illustrate this, imagine that the first dag the following pipeline process_data_on_spark >> check_new_files_on_statingstorage >> [filetype2_exists, write_new_data_to_warehouse] filetype2_exists >> read_data_from_filetype2 >> merge_filetype2_filetype2 >> write_new_data_to_warehouse . Where the filetype2_exists is a BranchPythonOperator, that verify if DAG_2 was triggered, and if it was, it will merge the resulted data form DAG2 with the DAG1 before execute write_new_data_to_warehouse.
Based on this DAG model, there will be one DAG run for each tenant. So, the DAG can have multiple DAG runs running in parallel if we trigger more than one DAG run (one per tenant). Here is my first question:
Is a good practice work with multiple DAG runs in the same DAG instead of working with Dynamic DAGs ? In this case, I would end withprocess_data_on_spark _tenant1,
process_data_on_spark _tenant2, ...process_data_on_spark _tenantN. It worth mention that the number of tenants can reach hundreads.
Now, considering that the filetype2 can or not be present in the batch, and, considering that I would use the model mentioned above (on single DAG with multiples DAG run runnning in parallel - one for each tenant). The only idead that I have for check if DAG2 was triggered for the current batch (ie., filetype2 was present in the batch) was modify the DAG_run_id to include the batch_id, combined with the filetype:
The default dag_run_id: manual__2022-11-19T00:00:00+00:00
The new dag_run_id: manual__tenant1_20221120123323_filetype2__2022-11-19T00:00:00+00:00
And from then, I would be able to query the airflow metadatabse and check if there was an dag_run_id that contains the current batch_id and the filetype2 running, and, with a sensor, wait for the dag_status be success. Then, I could run the read_data_from_filetype2 task. Otherwise, if there is no dag_run_id with batch_id and filetype2 registed in airflow metadatabase, I can follow the write_new_data_to_warehouse directly.
Here's the other question:
Is a good practice to modify dag_run_id and use it combined with airflow metadatabase to control pipelines?
Considering this scenario, It would be better to create dynamic DAGs, even if there would be result in hundeads DAGs or working with dag_run_id and airflow_metadabase and keep parallel DAG runs in one single DAG?
Or, there would be a better approach for this problem?
Thank You.
Let's assume my dag converts a large data set from CSV format to parquet. While running the dag, for some reason my dag fails, is it possible to restore the progress when I re run the dag?
It shouldn't start from scratch after I re run the dag.
Airflow dag is a collection of tasks, organized in a way that reflects their relationships and dependencies. So if you have a dag with 3 tasks: A -> B -> C, when the task C fails, you can just re run it without re running A and B, But if you re run the dag, that means you re run the task A with all the downstream tasks (B and C).
If you want to restore the progress within a task, you can do that based on your job logic but this is not related to airflow, it depends only on the techno you use and the logic you want to implement. For example, for your data, if you have multiple files in the dataset, you can create a state store on cloud storage or a database, to know the processing state for each file, and if the file is already processed, you can skip the processing and pass to the next one.
I want to run Airflow DAG in a continuous loop. Below is the dependency of my DAG:
create_dummy_start >> task1 >> task2 >> task3 >> create_dummy_end >> task_email_notify
The requirement is as soon as the flow reaches the create_dummy_end, the flow should re-iterate back to first task i.e. create_dummy_start.
I have tried re-triggering the DAG using below code:
`create_dummy_end = TriggerDagRunOperator(
task_id='End_Task',
trigger_dag_id=dag.dag_id,
dag=dag
)`
This will re-trigger the DAG but previous instance of DAG also keeps running, and hence it starts multiple instances parallelly which does not suffice the requirement.
I am new to Airflow, any inputs would be helpful.
By definition DAG is "Acyclic" (Directed Acyclic Graph) - there are no cycles.
Airflow - in general - works on "schedule" rather than "continuously" and while you can try to (as you did) trigger a new dag manually, this will always be "another dag run". There is no way to get Airflow in a continuous loop like that within a single DAG run.
You can use other tools for such purpose (which is much closer to streaming rather than Airflow's Batch processing). For example you can use Apache Beam for that - it seems to better fit your needs.
The current problem that I am facing is that I have documents in a MongoDB collection which each need to be processed and updated by tasks which need to run in an acyclic dependency graph. If a task upstream fails to process a document, then none of the dependent tasks may process that document, as that document has not been updated with the prerequisite information.
If I were to use Airflow, this leaves me with two solutions:
Trigger a DAG for each document, and pass in the document ID with --conf. The problem with this is that this is not the intended way for Airflow to be used; I would never be running a scheduled process, and based on how documents appear in the collection, I would be making 1440 Dagruns per day.
Run a DAG every period for processing all documents created in the collection for that period. This follows how Airflow is expected to work, but the problem is that if a task fails to process a single document, none of the dependent tasks may process any of the other documents. Also, if a document takes longer than other documents do to be processed by a task, those other documents are waiting on that single document to continue down the DAG.
Is there a better method than Airflow? Or is there a better way to handle this in Airflow than the two methods I currently see?
From the knowledge I gained in my attempt to answer this question, I've come to the conclusion that Airflow is just not the tool for the job.
Airflow is designed for scheduled, idempotent DAGs. A DagRun must also have a unique execution_date; this means running the same DAG at the exact same start time (in the case that we receive two documents at the same time is quite literally impossible. Of course, we can schedule the next DagRun immediately in succession, but this limitation should demonstrate that any attempt to use Airflow in this fashion will always be, to an extent, a hack.
The most viable solution I've found is to instead use Prefect, which was developed with the intention of overcoming some of the limitations of Airflow:
"Prefect assumes that flows can be run at any time, for any reason."
Prefect's equivalent of a DAG is a Flow; one key advantage of a flow that we may take advantage of is the ease of parametriziation. Then, with some threads, we're able to have a Flow run for each element in a stream. Here is an example streaming ETL pipeline:
import time
from prefect import task, Flow, Parameter
from threading import Thread
def stream():
for x in range(10):
yield x
time.sleep(1)
#task
def extract(x):
# If 'x' referenced a document, in this step we could load that document
return x
#task
def transform(x):
return x * 2
#task
def load(y):
print("Received y: {}".format(y))
with Flow("ETL") as flow:
x_param = Parameter('x')
e = extract(x_param)
t = transform(e)
l = load(t)
for x in stream():
thread = Thread(target=flow.run, kwargs={"x": x})
thread.start()
You could change trigger_rule from "all_success" to "all_done"
https://github.com/apache/airflow/blob/62b21d747582d9d2b7cdcc34a326a8a060e2a8dd/airflow/example_dags/example_latest_only_with_trigger.py#L40
And also could create a branch that processes failed documents with trigger_rule set to "one_failed" to move processes those failed documents somehow differently (e.g. move to a "failed" folder and send a notification)
I would be making 1440 Dagruns per day.
With a good Airflow architecture, this is quite possible.
Choking points might be
executor - use Celery Executor instead of Local Executor for example
backend database - monitor and tune as necessary (indexes, proper storage etc)
webserver - well, for thousands of dagruns, tasks etc.. perhaps only use webeserver for dev/qa environments, and not for production where you have higher rate of task/dagruns submissions. You could use cli etc instead.
Another approach is scaling out by running multiple Airflow instances - partition documents let's say to ten buckets, and assign each partition's documents to just one Airflow instance.
I'd process the heavier tasks in parallel and feed successful operations downstream. As far as I know, you can't feed successes asynchronously to downstream tasks, so you would still need to wait for every thread to finish until moving downstream but, this would still be well more acceptable than spawning 1 dag for each record, something in these lines:
Task 1: read mongo filtering by some timestamp (remember idempotence) and feed tasks (i.e. via xcom);
Task 2: do stuff in paralell via PythonOperator, or even better via K8sPod, i.e:
def thread_fun(ret):
while not job_queue.empty():
job = job_queue.get()
try:
ret.append(stuff_done(job))
except:
pass
job_queue.task_done()
return ret
# Create workers and queue
threads = []
ret = [] # a mutable object
job_queue = Queue(maxsize=0)
for thr_nr in appropriate_thread_nr:
worker = threading.Thread(
target=thread_fun,
args=([ret])
)
worker.setDaemon(True)
threads.append(worker)
# Populate queue with jobs
for row in xcom_pull(task_ids=upstream_task):
job_queue.put(row)
# Start threads
for thr in threads:
thr.start()
# Wait to finish their jobs
for thr in threads:
thr.join()
xcom_push(ret)
Task 3: Do more stuff coming from previous task, and so on
We have built a system that queries MongoDB for a list, and generates a python file per item containing one DAG (note: having each dag have its own python file helps Airflow scheduler efficiency, with it's current design) - the generator DAG runs hourly, right before the scheduled hourly run of all the generated DAGs.