I am trying to see if I airflow is a good fit for this scenario. At present, I have a DAG. This looks for a trigger file at s3, creates EMR cluster and submit spark job, then delete the EMR cluster.
My requirement is to convert this into on demand run. There will be many users running the export from the application. For each export run, I will have to call this DAG. That means there will be more than once instance of the same DAG will be running at the sametime.
I know we an make an API call to trigger a DAG. But I am not sure if we can run more than once instance of a DAG at the sametime. Can anyone had similar use case?
I am handling this with max_active_runs
dag = DAG(
'dev_clickstream_v1',
max_active_runs=5,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=2),
params=PARAMS
)
Related
Recently I'm developing an airflow pipeline that will be running for multi tenants. This DAG will be triggered via API, and separated by batches, which is controlled by a metadabase in SQL following some business rules.
Each batch has a batch_id in order to controll the batches, and it is passed to conf DAG via API. The batch id has the timestamp of creation combined with tenant and filetype. Example: tenant1_20221120123323 ... tenant2_20221120123323. These batches can contain two filetypes ( for example purpouses) and for each filetype a DAG is triggered (DAG1 for filetype 1 and DAG2 for filetype 2) and then from the file perspective, it is combined with the filetype in some stages tenant1_20221120123323_filetype1, tenant1_20221120123323_filetype2 ...
For illustrate this, imagine that the first dag the following pipeline process_data_on_spark >> check_new_files_on_statingstorage >> [filetype2_exists, write_new_data_to_warehouse] filetype2_exists >> read_data_from_filetype2 >> merge_filetype2_filetype2 >> write_new_data_to_warehouse . Where the filetype2_exists is a BranchPythonOperator, that verify if DAG_2 was triggered, and if it was, it will merge the resulted data form DAG2 with the DAG1 before execute write_new_data_to_warehouse.
Based on this DAG model, there will be one DAG run for each tenant. So, the DAG can have multiple DAG runs running in parallel if we trigger more than one DAG run (one per tenant). Here is my first question:
Is a good practice work with multiple DAG runs in the same DAG instead of working with Dynamic DAGs ? In this case, I would end withprocess_data_on_spark _tenant1,
process_data_on_spark _tenant2, ...process_data_on_spark _tenantN. It worth mention that the number of tenants can reach hundreads.
Now, considering that the filetype2 can or not be present in the batch, and, considering that I would use the model mentioned above (on single DAG with multiples DAG run runnning in parallel - one for each tenant). The only idead that I have for check if DAG2 was triggered for the current batch (ie., filetype2 was present in the batch) was modify the DAG_run_id to include the batch_id, combined with the filetype:
The default dag_run_id: manual__2022-11-19T00:00:00+00:00
The new dag_run_id: manual__tenant1_20221120123323_filetype2__2022-11-19T00:00:00+00:00
And from then, I would be able to query the airflow metadatabse and check if there was an dag_run_id that contains the current batch_id and the filetype2 running, and, with a sensor, wait for the dag_status be success. Then, I could run the read_data_from_filetype2 task. Otherwise, if there is no dag_run_id with batch_id and filetype2 registed in airflow metadatabase, I can follow the write_new_data_to_warehouse directly.
Here's the other question:
Is a good practice to modify dag_run_id and use it combined with airflow metadatabase to control pipelines?
Considering this scenario, It would be better to create dynamic DAGs, even if there would be result in hundeads DAGs or working with dag_run_id and airflow_metadabase and keep parallel DAG runs in one single DAG?
Or, there would be a better approach for this problem?
Thank You.
I have two DAGs in my airflow scheduler, which were working in the past. After needing to rebuild the docker containers running airflow, they are now stuck in queued. DAGs in my case are triggered via the REST API, so no actual scheduling is involved.
Since there are quite a few similar posts, I ran through the checklist of this answer from a similar question:
Do you have the airflow scheduler running?
Yes!
Do you have the airflow webserver running?
Yes!
Have you checked that all DAGs you want to run are set to On in the web ui?
Yes, both DAGS are shown in the WebUI and no errors are displayed.
Do all the DAGs you want to run have a start date which is in the past?
Yes, the constructor of both DAGs looks as follows:
dag = DAG(
dag_id='image_object_detection_dag',
default_args=args,
schedule_interval=None,
start_date=days_ago(2),
tags=['helloworld'],
)
Do all the DAGs you want to run have a proper schedule which is shown in the web ui?
No, I trigger my DAGs manually via the REST API.
If nothing else works, you can use the web ui to click on the dag, then on Graph View. Now select the first task and click on Task Instance. In the paragraph Task Instance Details you will see why a DAG is waiting or not running.
Here is the output of what this paragraph is showing me:
What is the best way to find the reason, why the tasks won't exit the queued state and run?
EDIT:
Out of curiousity I tried to trigger the DAG from within the WebUI and now both Runs executed (the one triggered from the WebUI failed, but that was expected, since there was no config set)
I have one dag that tells another dag what tasks to create in a specific order.
Dag 1 -> a file that has a task order
This runs every 5 minutes or so to keep this file fresh.
Dag 2 -> runs the task
this runs daily.
How can I pass this data between the two DAGs using Airflow.
Solutions and problems
The problem with using Airflow Variables is that I cannot set them at runtime.
The problem with using Xcoms is that they can only be run during the task stage and once the tasks are created in Dag 2, they're set and cannot be changed correct?
The problem with pushing the file to s3 is that the airflow instance doesn't have permission to pull from s3 due to security reasons decided by a team that I have no control over.
So what can I do? What are some choices I have?
What is the file format of the output from the 1st DAG? I would recommend the following workflow
Dag 1 -> Update the tasks order and store it in a yaml or json file inside the airflow environment.
Dag 2 -> Read the file to create the required tasks and run them daily.
You need to understand that airflow is constantly reading your dag files to have the latest configuration, so no extra step would be required.
I have had a similar issue in the past and it largely depends on your setup.
If you are running Airflow on Kubernetes this might work.
You create a PV(Persistent Volume) and PVC
You start your application with a KubernetesOperator and mount the PVC to it.
You store the result on the PVC.
You mount the PVC to the other pod.
I am trying to manage airflow dags (create, execute etc.) via java backend. Currently after creating a dag and placing it in dags folder of airflow my backend is constantly trying to run the dag. But it can't run it until its picked up by airflow scheduler, which can take quite some time if the number of dags are more. I am wondering if there any events that airflow emits which I can tap to check for new dags processed by scheduler, and then trigger, execute command from my backend. Or is there a way or configuration where airflow will automatically start a dag once it processes it rather than we triggering it ?
is there a way or configuration where airflow will automatically start a dag once it processes it rather than we triggering it ?
Yes, one of the parameters that you can define is is_paused_upon_creation.
If you set your DAG as:
DAG(
dag_id='tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval="#daily",
start_date=datetime(2020, 12, 28),
is_paused_upon_creation=False
)
The DAG will start as soon as picked up by the scheduler (assuming conditions to run it are met)
I am wondering if there any events that airflow emits which I can tap to check for new dags processed by scheduler
In Airflow >=2.0.0 you can use the API - list dags endpoint to get all dags that are in the dagbag
In any Airflow version you can use this code to list the dag_ids:
from airflow.models import DagBag
print(DagBag().dag_ids())
I created 4 SubDags within the main Dag which will run on different schedule_interval. I removed the operation of one SubDag but it still appears on Airflow's Database. Will that entry in the database execute? Is there a way to delete that from Airflow's database?
The record will persist in the database, however if the DAG isn't actually present on the scheduler (and workers depending on your executor), it can't be added to the DagBag and won't be run.
Having a look at this simplified scheduler of what the scheduler does:
def _do_dags(self, dagbag, dags, tis_out):
"""
Iterates over the dags and schedules and processes them
"""
for dag in dags:
self.logger.debug("Scheduling {}".format(dag.dag_id))
dag = dagbag.get_dag(dag.dag_id)
if not dag:
continue
try:
self.schedule_dag(dag)
self.process_dag(dag, tis_out)
self.manage_slas(dag)
except Exception as e:
self.logger.exception(e)
The scheduler will check if the dag is contained in the DagBag before it does any processing on it. Entries for DAGs are kept in the database to maintain the historical record of what dates have been processed should you re-add it in the future. But for all intents and purposes, you can treat a missing DAG as a paused DAG.