Terraform for AWS EMR autoscale - emr

Have anyone ever used Terraform to provision an AWS EMR cluster with Auto Scaling Task node?
If yes, please share your experience.
Thanks.

Currently it's not implemented. There are several issues at github and the first step towards EMR autoscaling has been made by being able to specify the autoscaling IAM role for the EMR cluster https://www.terraform.io/docs/providers/aws/r/emr_cluster.html#autoscaling_role.
You must manage the autoscaling policies by yourself via AWS CLI aws emr put-auto-scaling-policy and aws emr remove-auto-scaling-policy.
Issue requesting feature for actual EMR autoscaling: https://github.com/terraform-providers/terraform-provider-aws/issues/713

Related

Set up Airflow production environment

I'm a newbie at using Airflow. I went through many Airflow tutorials, and I can say that all are about development environments using a docker-compose file or files. I'm facing a problem at work setting up a production environment properly. My goal is to have a cluster composed of 3 EC2 virtual machines. Can anyone share best practices for installing Airflow on that cluster?
I went through many tutorials on the internet.
Airflow has 4 main components:
Webserver: stateless service which expose the UI and the REST API of Airflow
Scheduler: stateless service which processes the dags and runs them and it's the main component
Worker: stateless service which execute the tasks
Metadata: the database of Airflow where the state is stored, and it manages the communication between the 3 other components
And Airflow has 4 main executors:
LocalExecutor: the scheduler runs tasks by itself by spawning a process for each task, and it works in a single host -> not suitable for your need
CeleryExecutor: the most used scheduler, you can create one or multiple scheduler (for HA), and a group of celery workers to run the tasks, you can scale it on different nodes
DaskExecutor: similar to CeleryExecutor but it uses Dask instead of Celery, not much used, and there is no many resources around it
KubernetesExecutor: it runs each task in a K8S pod, and since it's based on Kubernetes, it's very scalable, but it has some drawbacks.
For you use case, I recommend using CeleryExecutor.
If you can use EKS instead of EC2, you can use the helm chart to install and configure the cluster. And if not, you have other options:
run the services directly on the host:
pip install apache-airflow[celery]
# run the webserver
airflow webserver
# run the scheduler
airflow scheduler
# run the worker
airflow celery worker
You can decide how many scheduler, workers and webserver you want to execute, and you can distribute them on the 3 nodes, for ex: node1(1 scheduler, 1 webserver, 1 worker), node2(1 scheduler, 2 workers), node3(1 webserver, 2 workers), and you need a DB, you can use postgres from AWS RDS, or create it on one of the nodes (not recommended).
using docker: same as the first solution, but your run containers instead of running the services directly on the host
using docker swarm: you can connect the 3 nodes to create a swarm cluster, and manage the config from one of the nodes, this gives you some feature which are not provided by the first 2 solutions, and it's similar to K8S. (doc)
For the 3 solutions, you need to create airflow.cfg file contains the configurations and the DB creds, and you should set the exeutor conf to CeleryExecutor.

How can I add EFS to an Airflow deployment on Amazon-EKS?

Kubernetes and EKS newbie here.
I've set up an Elastic Kubernetes Service (EKS) cluster and added an Airflow deployment on top of it using the official HELM chart for Apache Airflow. I configured gitsync and can successfully run my DAGS. For some of the DAGs, I need to save the data to an Amazon EFS. I installed the Amazon EFS CSI driver on eks following the instruction on the amazon documentation.
Now, I can create a new pod with access to the NFS but the airflow deployment broke and stay in a state of Back-off restarting failed container. I also got the events with kubectl -n airflow get events --sort-by='{.lastTimestamp} and I get the following messages:
TYPE REASON OBJECT MESSAGE
Warning BackOff pod/airflow-scheduler-599fc856dc-c4pgz Back-off restarting failed container
Normal FailedBinding persistentvolumeclaim/redis-db-airflow-redis-0 no persistent volumes available for this claim and no storage class is set
Warning ProvisioningFailed persistentvolumeclaim/ebs-claim storageclass.storage.k8s.io "ebs-sc" not found
Normal FailedBinding persistentvolumeclaim/data-airflow-postgresql-0 no persistent volumes available for this claim and no storage class is set
I have tried this on EKS version 1.22.
I understand from this that airflow is expecting to get an EBS volume for its pods but the NFS driver changed the configuration of the pvs.
The pvs before I install the driver are this:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-###### 100Gi RWO Delete Bound airflow/logs-airflow-worker-0 gp2 1d
pvc-###### 8Gi RWO Delete Bound airflow/data-airflow-postgresql-0 gp2 1d
pvc-###### 1Gi RWO Delete Bound airflow/redis-db-airflow-redis-0 gp2 1d
After I install the EFS CSI driver, I see the pvs have changed.
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
efs-pvc 5Gi RWX Retain Bound efs-storage-claim efs-sc 2d
I have tried deploying airflow before or after installing the EFS driver and in both cases I get the same error.
How can I get access to the NFS from within Airflow without breaking the Airflow deployment on EKS. Any help would be appreciated.
As stated in the error above no persistent volumes available for this claim and no storage class is set and storageclass.storage.k8s.io "ebs-sc" not found, you have to deploy a storage class called efs-sc using the EFS CSI driver as a provisioner.
Further documentation could be found here
An example of creating your missing storage class and persistent volume could be found here
These steps are also described in the AWS EKS user guide

Airflow stored in the cloud?

I would like to know if I can make the airflow UI accessible to all people who have a user, web page type. For this, I would have to connect it to a server, no? Which server do you recommend for this? I was looking around and some were using Amazon EC2.
If your goal is just making the airflow UI visible to public, there is a lot of solutions, where you can do it even in your local computer (of course it is not a good idea).
Before choosing the cloud provider and the service, you need to think about the requirements:
in your team, do you have the skills and the time to manage the server? if no you need a managed service like GCP cloud composer or AWS MWAA.
which executor yow want to use? KubernetesExecutor? CeleryExecutor on K8S? if yes you need a K8S service and not just a VM.
do you have a huge loading? do you need a HA mode? what about the scalability?
After defining the requirements, you can choose between the options:
Small server with LocalExecutor or CeleryExecutor on a VM -> AWS EC2 with a static IP and Route 53 for DNS name
A scalable server in HA mode on a K8S cluser -> AWS EKS or google GKE
A managed service and focusing only on the development part -> google cloud composer

Sharing resources between two independent Openstack cloud setups

Is there any possibility to share resources from one openstack cloud to another similar one with different resource pools.Thanks in advance.
You can try CloudFerry:Github Link
CloudFerry is a tool for resources and workloads migration between two OpenStack clouds.
Another Tool Stack2Stack: Github Link
stack2stack is a simple python script to aid data migration from one Openstack cloud to another through use of the APIs. It aims to cleanly migrate the data keeping as much in sync as possible, up to the limitations of the Openstack APIs themselves. Currently the script migrates
Keystone Users
Keystone Tenants
Keystone roles
Keystone Tenant Memberships
Glance Images
Networks from nova networking to neutron
Security groups from nova networking to neutron

How to mount a EBS in Cloudify after the creation of a VM

I want to share some data with my VMs thanks to a mounted EBS.
How can I say to Cloudify that every created VM should have additional mounted EBS?
(I'm talking about EBS in the case of Amazon EC2, but I want to do the same with OpenStack, and other IaaS)
For ec2, you would need to set the template options in the template section of the cloud configuration file as follows:
options ([
"securityGroups" : ["default"]as String[],
"keyPair" : "XXXXX",
"blockDeviceMappings": [new org.jclouds.ec2.domain.BlockDeviceMapping.MapEBSSnapshotToDevice("/dev/sda1/","aa", 20, true) ] ])
Cloudify uses the jclouds multi-cloud library to handle API calls to amazon services. For more details on using EBS with EC2, see:
http://demobox.github.com/jclouds-maven-site-1.4.0/1.4.0/jclouds-multi/apidocs/org/jclouds/ec2/domain/BlockDeviceMapping.MapEBSSnapshotToDevice.html
http://demobox.github.com/jclouds-maven-site-1.4.0/1.4.0/jclouds-multi/apidocs/org/jclouds/ec2/domain/BlockDeviceMapping.MapNewVolumeToDevice.html
Please note that these settings are specific to EC2 and are not portable across clouds.
With regards to Openstack, The Cloudify Openstack cloud driver does not currently support using volumes, the openstack EBS equivalent. This is accurate for version 2.1.1 and 2.2 of Cloudify, though this feature is expected to become available in the near future.

Resources