How to parameterize DataprocSubmitJobOperator in airflow - airflow

In airflow while triggering dataproc spark jobs ,we can pass the parameters required by the spark jobs.
However, wondering if we can pass parameters based on schedules.
Eg.
If the date is between 1-7, execute dataproc jobs using X parameter.
If the date is between 8-end_of_month, execute dataproc jobs using Y parameter.
This will help to avoid redundant DAGs.

Related

Is it a good practice to use airflow metadatabase to control pipelines?

Recently I'm developing an airflow pipeline that will be running for multi tenants. This DAG will be triggered via API, and separated by batches, which is controlled by a metadabase in SQL following some business rules.
Each batch has a batch_id in order to controll the batches, and it is passed to conf DAG via API. The batch id has the timestamp of creation combined with tenant and filetype. Example: tenant1_20221120123323 ... tenant2_20221120123323. These batches can contain two filetypes ( for example purpouses) and for each filetype a DAG is triggered (DAG1 for filetype 1 and DAG2 for filetype 2) and then from the file perspective, it is combined with the filetype in some stages tenant1_20221120123323_filetype1, tenant1_20221120123323_filetype2 ...
For illustrate this, imagine that the first dag the following pipeline process_data_on_spark >> check_new_files_on_statingstorage >> [filetype2_exists, write_new_data_to_warehouse] filetype2_exists >> read_data_from_filetype2 >> merge_filetype2_filetype2 >> write_new_data_to_warehouse . Where the filetype2_exists is a BranchPythonOperator, that verify if DAG_2 was triggered, and if it was, it will merge the resulted data form DAG2 with the DAG1 before execute write_new_data_to_warehouse.
Based on this DAG model, there will be one DAG run for each tenant. So, the DAG can have multiple DAG runs running in parallel if we trigger more than one DAG run (one per tenant). Here is my first question:
Is a good practice work with multiple DAG runs in the same DAG instead of working with Dynamic DAGs ? In this case, I would end withprocess_data_on_spark _tenant1,
process_data_on_spark _tenant2, ...process_data_on_spark _tenantN. It worth mention that the number of tenants can reach hundreads.
Now, considering that the filetype2 can or not be present in the batch, and, considering that I would use the model mentioned above (on single DAG with multiples DAG run runnning in parallel - one for each tenant). The only idead that I have for check if DAG2 was triggered for the current batch (ie., filetype2 was present in the batch) was modify the DAG_run_id to include the batch_id, combined with the filetype:
The default dag_run_id: manual__2022-11-19T00:00:00+00:00
The new dag_run_id: manual__tenant1_20221120123323_filetype2__2022-11-19T00:00:00+00:00
And from then, I would be able to query the airflow metadatabse and check if there was an dag_run_id that contains the current batch_id and the filetype2 running, and, with a sensor, wait for the dag_status be success. Then, I could run the read_data_from_filetype2 task. Otherwise, if there is no dag_run_id with batch_id and filetype2 registed in airflow metadatabase, I can follow the write_new_data_to_warehouse directly.
Here's the other question:
Is a good practice to modify dag_run_id and use it combined with airflow metadatabase to control pipelines?
Considering this scenario, It would be better to create dynamic DAGs, even if there would be result in hundeads DAGs or working with dag_run_id and airflow_metadabase and keep parallel DAG runs in one single DAG?
Or, there would be a better approach for this problem?
Thank You.

In airflow, can I create the pool inside the DAG if it does not exists?

I have a dag that triggers an external DAG using TriggerDagOperator. The trigger DAG queries a database and based on a type ID, it will trigger the associated external DAG along with the parameters needed for the external DAG. I would want to pass a pool name as part of these parameters and just wondering if I can create the pool in the (external) DAG if it does not exists.

How do TaskInstances in the same process share variables in airflow

I have a requirement to get information from the current instance process in a running DAGs instance
For example, if I have created an DAGs instance [run_id] via the airflow API, do I have a way to get the global variables of this process group and define a method that is aware of the global variables of each DAGs instance to get the parameters I want
If you need to cross-communication between tasks you can use Xcom
Note that xcom is used to share metadata and are limited in size.
Airflow also offer Variables as key/value store.

how to get airflow DAG execution date using command line?

In order for me to get the dag_state, I run the following LCI command:
airflow dag_state example_bash_operator '12-12T16:04:46.960661+00:00'
The trouble is - I have to explicitly pass the exact date-time (i.e. execution_date) to this command.
When I run airflow list_dags I only get a listing of DAG's but not their execution dates.
Is there a way to obtain the exact date time (i.e. -> '12-12T16:04:46.960661+00:00')
for a given dag, using command line CLI?
There's a conceptual issue here. Dags are objects that have schedules, not execution dates. When the schedule is due, DagRuns are created for that Dag with the appropriate execution_date.
So you can ask for the state of a DagRun using the CLI and providing the execution_date, because execution dates (almost uniquely) map to a specific DagRun. Almost uniquely because in practice you can trigger two DagRuns with the same execution_date, but that's an unusual scenario.
But if you ask for the execution_date of a Dag, what do you really want to know? The execution_date of the last recently created DagRun? The list of execution_dates for the currently running DagRuns?
You can check list_dag_runsdag_id CLI command and see if yon can filter it to your needs.

Use Airflow for batch processing to dynamically start multiple tasks based on the output of a parent task

I am trying to figure out if Airflow can be used to express a workflow where multiple instances of the same task need to be started based on the output of a parent task. Airflow supports multiple workers, so I naively expect that Airflow can be used to orchestrate workflows involving batch processing. So far I failed to find any recipe/direction that would fit this model. What is the right way to leverage Airflow for a bath processing workflow like the one below? Assume there is a pool of Airflow workers.
Example of a workflow:
1. Start Task A to produce multiple files
2. For each file start an instance of Task B (might be another workflow)
3. Wait for all instances of Task B, then start Task C
As a hack to parallelize processing of input data in Airflow, I use a custom operator that splits the input into a predetermined number of partitions. The downstream operator gets replicated for each partition and if needed the result can be merged again. For local files, the operator runs the split command. In Kubernetes, this works nicely with the cluster autoscaling.

Resources