I know that XCom backend can be changed when using default Apache Airflow in Linux, ubuntu, docker, etc. However, I'm looking particularly if it is possible with MWAA.
Is it possible to setup Amazon Managed Workflows for Apache Airflow (MWAA) to change XCom Backend to be using S3? If is possible, how can I do it?
Related
Can someone provide a YAML file of the same mentioned above? I need it for a project.
I am trying to execute my tasks parallelly on each core of the workers, as it provides a performance boost for the task.
To achieve this I want to execute my Airflow tasks directly on the Dask cluster. As my project requires Airflow to be run on docker, I couldn't find any docker-compose.yaml file for Airflow with DaskExecutor.
Dask generally has a scheduler and some workers in its cluster.
Apart from this, I've tried to achieve this task parallelism with the airflow-provider-ray library by Astronomer registry. I've used this documentation to achieve so in docker. But I am facing OSError: Connection timeout. Here I am running airflow in docker and ray cluster on my local python environment.
Secondly, I've tried the same with the dask cluster. In this, there is Airflow running on docker with celery executor, and in another docker, there is Dask scheduler, two workers, and a notebook. Then I am able to connect these but I keep getting error - ModuleNotFoundError: No module named 'unusual_prefix_2774d32fcb40c2ba2f509980b973518ede2ad0c3_dask_client'.
The solution to any of these problems will be appreciated.
I'm using airflow_docker and it seems it does not come with an s3KeySensor. It does come with a Sensor Class. How can I create my own custom s3KeySensor? Do I just have to inherit from Sensor and overwrite the poke method? Can I literally just copy the source code from the s3KeySensor? Here's the source code for the s3KeySensor
The reason I am using airflow docker is that it can run in a container and I can pass in aws role creds to the task container so that it has the proper and exact permissions to do an action other than using the worker container's role permissions.
I would recommend upgrading to the latest version Airflow or at least start at Airflow 2.0.
In Airflow 2.0, a majority of the operators, sensors, and hooks that are not considered core functionality are categorized and separated into providers. Amazon has a provider, apache-airflow-providers-amazon, that includes the S3KeySensor.
You can also install backport providers so the one you would need is apache-airflow-backport-providers-amazon if you want to stay on Airflow 1.10.5 (based on the links you sent).
I have previously used Airflow running Ubuntu VM. I was able to login into VM via WinSCP and putty to run commands and edit Airflow related files as required.
But i first time came across Airflow running AWS ECS cluster. I am new to ECS So, i am trying to see what is the best possible way to :
run commands like "airflow dbinit", stop/start web-server and scheduler etc...
Install new python packages like "pip install "
View and edit Airflow files in ECS cluster
I was reading about AWS CLI and ECS cli could they be helpful ? or is there is any other best possible way that lets me do above mentioned actions.
Thanks in Advance.
Kind Regards
P.
There are many articles that describe how to run Airflow on ECS:
https://tech.gc.com/apache-airflow-on-aws-ecs/
https://towardsdatascience.com/building-an-etl-pipeline-with-airflow-and-ecs-4f68b7aa3b5b
https://aws.amazon.com/blogs/containers/running-airflow-on-aws-fargate/
many more
[ Note: Fargate allows you to run containers via ECS without a need to have EC2 instances. More here if you want additional background re what Fargate is]
Also, AWS CLI is a generic CLI that maps all AWS APIs (mostly 1:1). You may consider it for your deployment (but it should not be your starting point).
The ECS CLI is a more abstracted CLI that exposes higher level constructs and workflows that are specific to ECS. Note that the ECS CLI has been superseded by AWS Copilot which is an opinionated CLI to deploy workloads patterns on ECS. It's perhaps too opinionated to be able to deploy Airflow.
My suggestion is to go through the blogs above and get inspired.
at this time I use the LocalExecutor with airflow. My DAG is using docker images with the DockerOperator coming with airflow. For this the docker images must be present on the PC. If I want to use a distributed executor like CeleryExecutor or KubernetesExecutor the docker images must be present on all the machines which are part of the Celery or Kubernetes cluster?
Regards
Oli
That is correct. Since airflow workers run tasks locally, you will require to have docker images or nay other resources available locally on the worker. You could try this link to setup local docker registry which can serve docker images and save you efforts of maintaining them manually.
In Airflow scheduler, there are things like heartbeat and max_threads.
See How to reduce airflow dag scheduling latency in production?.
If I am using Google Cloud Composer, do I have to worry/set these values?
If not, what are the values that Google Cloud Composer uses?
You can see the airflow config in the composer instance bucket gs://composer_instance_bucket/airflow.cfg. You can tune this configuration as you wish, keeping in mind that cloud composer has some configurations blocked.
Also, if you go in the Airflow UI -> Admin -> Configuration you can see the full configuration.
If you'd like more control/visibility of these variables, consider hosted Airflow # Astronomer https://www.astronomer.io/cloud/ as it runs vanilla Airflow.