Pass parameter to SSHSparkSubmitOperator Operator

Pass parameter to SSHSparkSubmitOperator Operator - airflow

Suppose there is file which will have spark configuration parameters like # of executors, executor memory. In the airflow workflow 1st task will read the file get the above parameters and pass to SSHSparkSubmitOperator operators in the executor_memory=<some_variable>, num_executors=<some_variable>
Is there a way we can accomplish this?

Related

How to get region-name from DAG in MWAA

I would like to know how to get a region name like ap-northeast-1 from DAG in MWAA.
Is it possible to get it?
We can get region name from config file. However, we would like to refer it without config file.

You could use a custom environment variable like here or use the MWAA ECS environment variable. Below are available on our workers.
AWS_DEFAULT_REGION=eu-west-1
AWS_REGION=eu-west-1
and you can call os.environ["AWS_REGION"] in your dag.

Is there a way to pass a parameter to an airflow dag when triggering it manually

I have an airflow DAG that is triggered externally via cli.
I have a requirement to change order of the execution of tasks based on a Boolean parameter which I would be getting from the CLI.
How do I achieve this?
I understand dag_run.conf can only be used in a template field of an operator.
Thanks in advance.

You can not change tasks dependency with runtime parameter.
However you can pass runtime parameter (with dag_run.conf) that according to it's value tasks will be executed or be skipped for that you need to place operators in your workflow that can handle this logic for example: ShortCircuitOperator, BranchPythonOperator

How do TaskInstances in the same process share variables in airflow

I have a requirement to get information from the current instance process in a running DAGs instance
For example, if I have created an DAGs instance [run_id] via the airflow API, do I have a way to get the global variables of this process group and define a method that is aware of the global variables of each DAGs instance to get the parameters I want

If you need to cross-communication between tasks you can use Xcom
Note that xcom is used to share metadata and are limited in size.
Airflow also offer Variables as key/value store.

xcom_pull is not working in dynamic task generation function

I have a function which dynamically creates a sub task, where I am reading value from xcom_pull but there I am getting the error:
File "/home/airflow/gcs/recon_nik_v6.py", line 168, in create_audit_task
my_dict=kwargs["ti"].xcom_pull(task_ids='accept_input_from_cli', key='my_ip_dict')
KeyError: 'ti'
If I use the same my_dict=kwargs["ti"].xcom_pull(task_ids='accept_input_from_cli', key='my_ip_dict') code in another function then it works, but in this dynamic part it's not working.

Ssimilarly to your other questions (and explained in slack several times). This is not how Airflow works.
XCom pull and task instances are only available when DAG Run is being executed. When you create your DAG structure (i.e. dynamically generate DAGs) you cannot use them.
Only Task Instances when executing tasks can access them and this is already long after the DAGs have been parsed and DAG structure established.
So what you try to do is simply impossible.

Use Airflow for batch processing to dynamically start multiple tasks based on the output of a parent task

I am trying to figure out if Airflow can be used to express a workflow where multiple instances of the same task need to be started based on the output of a parent task. Airflow supports multiple workers, so I naively expect that Airflow can be used to orchestrate workflows involving batch processing. So far I failed to find any recipe/direction that would fit this model. What is the right way to leverage Airflow for a bath processing workflow like the one below? Assume there is a pool of Airflow workers.
Example of a workflow:
1. Start Task A to produce multiple files
2. For each file start an instance of Task B (might be another workflow)
3. Wait for all instances of Task B, then start Task C

As a hack to parallelize processing of input data in Airflow, I use a custom operator that splits the input into a predetermined number of partitions. The downstream operator gets replicated for each partition and if needed the result can be merged again. For local files, the operator runs the split command. In Kubernetes, this works nicely with the cluster autoscaling.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pass parameter to SSHSparkSubmitOperator Operator - airflow

Related

How to get region-name from DAG in MWAA

Is there a way to pass a parameter to an airflow dag when triggering it manually

How do TaskInstances in the same process share variables in airflow

xcom_pull is not working in dynamic task generation function

Use Airflow for batch processing to dynamically start multiple tasks based on the output of a parent task

Categories

Resources