I would like to know how to get a region name like ap-northeast-1 from DAG in MWAA.
Is it possible to get it?
We can get region name from config file. However, we would like to refer it without config file.
You could use a custom environment variable like here or use the MWAA ECS environment variable. Below are available on our workers.
AWS_DEFAULT_REGION=eu-west-1
AWS_REGION=eu-west-1
and you can call os.environ["AWS_REGION"] in your dag.
Related
I have an airflow DAG that is triggered externally via cli.
I have a requirement to change order of the execution of tasks based on a Boolean parameter which I would be getting from the CLI.
How do I achieve this?
I understand dag_run.conf can only be used in a template field of an operator.
Thanks in advance.
You can not change tasks dependency with runtime parameter.
However you can pass runtime parameter (with dag_run.conf) that according to it's value tasks will be executed or be skipped for that you need to place operators in your workflow that can handle this logic for example: ShortCircuitOperator, BranchPythonOperator
I have a requirement to get information from the current instance process in a running DAGs instance
For example, if I have created an DAGs instance [run_id] via the airflow API, do I have a way to get the global variables of this process group and define a method that is aware of the global variables of each DAGs instance to get the parameters I want
If you need to cross-communication between tasks you can use Xcom
Note that xcom is used to share metadata and are limited in size.
Airflow also offer Variables as key/value store.
I have one dag that tells another dag what tasks to create in a specific order.
Dag 1 -> a file that has a task order
This runs every 5 minutes or so to keep this file fresh.
Dag 2 -> runs the task
this runs daily.
How can I pass this data between the two DAGs using Airflow.
Solutions and problems
The problem with using Airflow Variables is that I cannot set them at runtime.
The problem with using Xcoms is that they can only be run during the task stage and once the tasks are created in Dag 2, they're set and cannot be changed correct?
The problem with pushing the file to s3 is that the airflow instance doesn't have permission to pull from s3 due to security reasons decided by a team that I have no control over.
So what can I do? What are some choices I have?
What is the file format of the output from the 1st DAG? I would recommend the following workflow
Dag 1 -> Update the tasks order and store it in a yaml or json file inside the airflow environment.
Dag 2 -> Read the file to create the required tasks and run them daily.
You need to understand that airflow is constantly reading your dag files to have the latest configuration, so no extra step would be required.
I have had a similar issue in the past and it largely depends on your setup.
If you are running Airflow on Kubernetes this might work.
You create a PV(Persistent Volume) and PVC
You start your application with a KubernetesOperator and mount the PVC to it.
You store the result on the PVC.
You mount the PVC to the other pod.
Suppose there is file which will have spark configuration parameters like # of executors, executor memory. In the airflow workflow 1st task will read the file get the above parameters and pass to SSHSparkSubmitOperator operators in the executor_memory=<some_variable>, num_executors=<some_variable>
Is there a way we can accomplish this?
I have a dag that checks for files on an FTP server (airflow runs on separate server). If file(s) exist, the file(s) get moved to S3 (we archive here). From there, the filename is passed to a Spark submit job. The spark job will process the file via S3 (spark cluster on different server). I'm not sure if I need to have multiple dags but here's the flow. What I'm looking to do is to only run a Spark job if a file exist in the S3 bucket.
I tried using an S3 sensor but that fails/timeouts after it meets the timeout criteria, therefore the whole dag is set to failed.
check_for_ftp_files -> move_files_to_s3 -> submit_job_to_spark -> archive_file_once_done
I only want to run everything after the script that does the FTP check ONLY when a file or files were moved into S3.
You can have 2 different DAGs. One only has the S3 sensor and keeps running, lets say, every 5 minutes. If it finds the file, it triggers the second DAG. The second DAG submits the file to S3 and archives if done. You can use TriggerDagRunOperator in the first DAG for triggering.
The answer Him gave will work.
Another option is using the "soft_fail" parameter that Sensors have (it is a parameter from the BaseSensorOperator). IF you set this parameter to True, instead of failing a task, it will skip it and all following tasks in the branch will also be skipped.
See airflow code for more info.