Glue operator ignores "iam_role_name" parameter - airflow

I am trying to have a DAG trigger a Glue job. I have a role that can run these Glue jobs, and I want the Glue task to assume this role. The operator has a parameter called iam_role_name, but when I pass the name of my Glue role as this parameter, it still tries to execute as the Airflow role rather than assuming the role I gave.
I contacted AWS support and they said that basically my options are:
Create a connection tied to the Glue role, and use that connection for the operator
Add the Glue permissions to the Airflow role
Use their own operator: https://github.com/aws-samples/mwaa-rbac-task
It seems to me that the developers of AwsGlueJobHook intended the iam_role_name parameter to be the easy way to assume roles. Yet the parameter seems to have no effect in practice. Why is this?
I am using:
Airflow 2.2.2 via AWS MWAA
apache-airflow-providers-amazon 2.4.0

In AwsGlueJobHook, the iam_role_name is used in this method to get the role information, and from these information, it uses only the role arn as an argument for the method create_job which create you job.
But in fact, the role/user used to create the client is the one found in the airflow connection.

Related

Which Aurora table store DAG variable information?

I have an airflow DAG which call a particular bash command using a variable. At the backend, we have Aurora DB. Do we know if there are any tables in the Aurora DB which stores information of the variables used in Airflow DAGs? I need to create a report out of it and hence, the ask to access the variables from backend.
I tried using the operational_insights schema but could not find any tables with the desired information.
If you are using an Airflow variable you should be able to query a list of them with the REST API no matter which backend you use.
curl "http://<your Airflow host>/api/v1/variables" --user "login:password"
This is preferred over querying the Airflow metadata database directly because if you accidentally modify or drop a table you can corrupt your Airflow.
With that caveat: the standard table where Airflow variables are stored is variable so after logging into the db SELECT * FROM variable; should return a list.
Again this is for Airflow Variables. From your question I am not entirely sure if you mean that or in general any variables that tasks use. In the latter case you might be looking for the rendered_fields parameter of the task instances, which can also be done using the API.

Provide aws credentials to Airflow GreatExpectationsOperator

I would like to use GreatExpectationsOperator to perform data quality validations.
The validation results data should be stored in S3.
I don't see an option to send an airflow connection name to the GE operator, and the AWS credentials in my organization are stored in an airflow connection.
How can great expectations retrieve s3 credentials from airflow connection? and not from the default aws credentials in .aws dir?
Thanks!
We ended up creating a new oprator that inherit from GE operator and the operator get the connection as part of its ecxeute method.

Airflow Metadata DB = airflow_db?

I have a project requirement to back-up Airflow Metadata DB to some data warehouse (but not using an Airflow DAG). At the same time, the requirement mentions some connection called airflow_db.
I am quite new to Airflow, so I googled a bit on the topic. I am a bit confused about this part. Our Airflow Metadata DB is PostgreSQL (this is built from docker-compose, so I am tinkering on a local install), but when I look at Connections in Airflow Web UI, it says airflow_db is MySQL.
I initially assumed that they are the same, but by the looks of it, they aren't? Can someone explain the difference and what they are for?
Airflow creates airflow_db Conn Id with MySQL by default (see source code)
Default connections are not really useful in production system. It's just a long list of stuff that you are probably not going to use.
Airflow 1.1.10 introduced the ability not to create the default list by setting:
load_default_connections = False in airflow.cfg (See PR)
To give more background the connection list is where hooks find the information needed in order to connect to a service. It's not related to the backend database. Though the backend is db like any db and if you wish to allow hooks to interact with it you can define it in the list like any other connection (which is probably why you have this as option in the default).

Airflow logs in s3 bucket

Would like to write the airflow logs to s3. Following are the parameter that we need to set according to the doc-
remote_logging = True
remote_base_log_folder =
remote_log_conn_id =
If Airflow is running in AWS, why do I have to pass the AWS keys? Shouldn't the boto3 API be able to write/read to s3 if correct permission are set on IAM role attached to the instance?
Fair point, but I think it allows for more flexibility if Airflow is not running on AWS or if you want to use a specific set of credentials rather than give the entire instance access. It might have also been easier implementation as well because the underlying code for writing logs into S3 uses the S3Hook (https://github.com/apache/airflow/blob/1.10.9/airflow/utils/log/s3_task_handler.py#L47), which requires a connection id.

Cloudera Post deployment config updates

In cloudera is there a way to update list of configurations at a time using CM-API or CURL?
Currently I am updating one by one one using below CM API.
services_api_instance.update_service_config()
How can we update all configurations stored in json/config file at a time.
The CM API endpoint you're looking for is PUT /cm/deployment. From the CM API documentation:
Apply the supplied deployment description to the system. This will create the clusters, services, hosts and other objects specified in the argument. This call does not allow for any merge conflicts. If an entity already exists in the system, this call will fail. You can request, however, that all entities in the system are deleted before instantiating the new ones.
This basically allows you to configure all your services with one call rather than doing them one at a time.
If you are using services that require a database (Hive, Hue, Oozie ...) then make sure you set them up before you call the API. It expects all the parameters you pass in to work so external dependencies must be resolved first.

Resources