Suppose there is file which will have spark configuration parameters like # of executors, executor memory. In the airflow workflow 1st task will read the file get the above parameters and pass to SSHSparkSubmitOperator operators in the executor_memory=<some_variable>, num_executors=<some_variable>
Is there a way we can accomplish this?
Related
I would like to know how to get a region name like ap-northeast-1 from DAG in MWAA.
Is it possible to get it?
We can get region name from config file. However, we would like to refer it without config file.
You could use a custom environment variable like here or use the MWAA ECS environment variable. Below are available on our workers.
AWS_DEFAULT_REGION=eu-west-1
AWS_REGION=eu-west-1
and you can call os.environ["AWS_REGION"] in your dag.
I have an airflow DAG that is triggered externally via cli.
I have a requirement to change order of the execution of tasks based on a Boolean parameter which I would be getting from the CLI.
How do I achieve this?
I understand dag_run.conf can only be used in a template field of an operator.
Thanks in advance.
You can not change tasks dependency with runtime parameter.
However you can pass runtime parameter (with dag_run.conf) that according to it's value tasks will be executed or be skipped for that you need to place operators in your workflow that can handle this logic for example: ShortCircuitOperator, BranchPythonOperator
I have a requirement to get information from the current instance process in a running DAGs instance
For example, if I have created an DAGs instance [run_id] via the airflow API, do I have a way to get the global variables of this process group and define a method that is aware of the global variables of each DAGs instance to get the parameters I want
If you need to cross-communication between tasks you can use Xcom
Note that xcom is used to share metadata and are limited in size.
Airflow also offer Variables as key/value store.
I have a function which dynamically creates a sub task, where I am reading value from xcom_pull but there I am getting the error:
File "/home/airflow/gcs/recon_nik_v6.py", line 168, in create_audit_task
my_dict=kwargs["ti"].xcom_pull(task_ids='accept_input_from_cli', key='my_ip_dict')
KeyError: 'ti'
If I use the same my_dict=kwargs["ti"].xcom_pull(task_ids='accept_input_from_cli', key='my_ip_dict') code in another function then it works, but in this dynamic part it's not working.
Ssimilarly to your other questions (and explained in slack several times). This is not how Airflow works.
XCom pull and task instances are only available when DAG Run is being executed. When you create your DAG structure (i.e. dynamically generate DAGs) you cannot use them.
Only Task Instances when executing tasks can access them and this is already long after the DAGs have been parsed and DAG structure established.
So what you try to do is simply impossible.
I am trying to figure out if Airflow can be used to express a workflow where multiple instances of the same task need to be started based on the output of a parent task. Airflow supports multiple workers, so I naively expect that Airflow can be used to orchestrate workflows involving batch processing. So far I failed to find any recipe/direction that would fit this model. What is the right way to leverage Airflow for a bath processing workflow like the one below? Assume there is a pool of Airflow workers.
Example of a workflow:
1. Start Task A to produce multiple files
2. For each file start an instance of Task B (might be another workflow)
3. Wait for all instances of Task B, then start Task C
As a hack to parallelize processing of input data in Airflow, I use a custom operator that splits the input into a predetermined number of partitions. The downstream operator gets replicated for each partition and if needed the result can be merged again. For local files, the operator runs the split command. In Kubernetes, this works nicely with the cluster autoscaling.