I have an Airflow DAG that starts an AWS EMR cluster to run steps. On the DAG we pass some variables that are set on Airflow Variables. But some of these variables are encrypted at Airflow, but when passing to EMR, we can see then clearly at EMR console. Is there any way to hide this?
Here is how we are defining the step. The airflow variable db_pass must be encrypted or hidden somehow
{
"Name": "EMR JOB",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"{{var.value.job_script}}",
"--database_user={{var.value.db_user}}",
"--database_pass={{var.value.db_pass}}"
]
}
}
]
This SAMPLE_STEP_DEFINITION is then passed as to the EmrAddStepsOperator:
...
sample_task = EmrAddStepsOperator(
steps=SAMPLE_STEP_DEFINITION,
...
There are several ways to do this. First I would suggest to encrypt passwords with KMS. Here is the code how to do it:
def encryptString(plainText: String, keyArn: String): String = {
val req = new EncryptRequest().withKeyId(keyArn).withPlaintext(ByteBuffer.wrap(plainText.getBytes))
Base64.getEncoder.encodeToString(kmsClient.encrypt(req).getCiphertextBlob.array())
}
def decryptString(encryptedText: String, keyArn: String): String = {
val req = new DecryptRequest().withCiphertextBlob(ByteBuffer.wrap(Base64.getDecoder.decode(encryptedText)))
new String(kmsClient.decrypt(req).getPlaintext.array())
}
Your just need to attach decrypt permission to EMR_EC2_DefaultRole.
Another way is to pass a config file stored on S3 with password.
Related
Airflow version :2.0.2
Trying to create Emr cluster, by retrying data from AWS secrets manager.
I am trying to write an airflow dag and, my task is to get data from this get_secret function and use it in Spark_steps
def get_secret():
secret_name = Variable.get("secret_name")
region_name = Variable.get(region_name)
# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(service_name='secretsmanager', region_name=region_name)
account_id = boto3.client('sts').get_caller_identity().get('Account')
try:
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
if 'SecretString' in get_secret_value_response:
secret_str = get_secret_value_response['SecretString']
secret=json.loads(secret_str)
airflow_path=secret["airflow_path"]
return airflow_path
...
I need to use "airflow_path" return value in below spark_steps
Spark_steps:
SPARK_STEPS = [
{
'Name': 'Spark-Submit Command',
"ActionOnFailure": "CONTINUE",
'HadoopJarStep': {
"Jar": "command-runner.jar",
"Args": [
'spark-submit',
'--py-files',
's3://'+airflow_path+'-pyspark/pitchbook/config.zip,s3://'+airflow_path+'-pyspark/pitchbook/jobs.zip,s3://'+airflow_path+'-pyspark/pitchbook/DDL.zip',
's3://'+airflow_path+'-pyspark/pitchbook/main.py'
],
},
},
I saw on the internet I need to use Xcom, is this right ?, and do I need to run this function in python operator first and then get the value. please provide an example as I am a newbie.
Thanks for your help.
Xi
Yes if you would like to pass dynamic stuff, leveraging xcom push/pull might be easier.
Leverage PythonOperator to push data into xcom.
See reference implementation:
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py#L108
https://www.startdataengineering.com/post/how-to-submit-spark-jobs-to-emr-cluster-from-airflow/
I'm using Airflow in Google Composer. By default all the tasks in a DAG use the default connection to communicate with Storage, BigQuery, etc. Obviously we can specify another connection configured in Airflow, ie:
task_custom = bigquery.BigQueryInsertJobOperator(
task_id='task_custom_connection',
gcp_conn_id='my_gcp_connection',
configuration={
"query": {
"query": 'SELECT 1',
"useLegacySql": False
}
}
)
Is it possible to use a specific connection as the default for all tasks in the entire DAG?
Thanks in advance.
UPDATE:
Specify gcp_conn_id via default_args in DAG (as Javier Lopez Tomas recommended) doesn't work completely. The Operators that expect gcp_conn_id as parameter works fine, but in my case unfortunately most of interactions with GCP components do so via clients or hooks within PythonOperators.
For example: If I call DataflowHook (inside a function called by a PythonOperator) without specifying the connection, it internally uses "google_cloud_default" and not "gcp_conn_id" :(
def _dummy_func(**context):
df_hook = DataflowHook()
default_args = {
'gcp_conn_id': 'my_gcp_connection'
}
with DAG(default_args=default_args) as dag:
dummy = PythonOperator(python_callable=_dummy_func)
You can use default args:
https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#default-arguments
In your case it would be:
default_args = {
"gcp_conn_id": "my_gcp_connection"
}
with DAG(blabla)...
I'm learning Airflow and am planning to set some variables to use across different tasks. These are in my dags folder, saved as configs.json, like so:
{
"vars": {
"task1_args": {
"something": "This is task 1"
},
"task2_args": {
"something": "this is task 2"
}
}
}
I get that we can enter Admin-->Variables--> upload the file. But I have 2 questions:
What if I want to adjust some of the variables while airflow is running? I can adjust my code easily and it updates in realtime but it doesn't seem like this works for variables.
Is there a way to just auto-import this specific file on startup? I don't want to have to add it every time I'm testing my project.
I don't see this mentioned in the docs but it seems like a pretty trivial thing to want.
What you are looking for is With code, how do you update an airflow variable?
Here's an untested snippet that should help
from airflow.models import Variable
Variable.set(key="my_key", value="my_value")
So basically you can write a bootstrap python script to do this setup for you.
In our team, we use such scripts to setup all Connections, and Pools too
In case you are wondering, here's the set(..) method from source
#classmethod
#provide_session
def set(
cls,
key: str,
value: Any,
serialize_json: bool = False,
session: Session = None
):
"""
Sets a value for an Airflow Variable with a given Key
:param key: Variable Key
:param value: Value to set for the Variable
:param serialize_json: Serialize the value to a JSON string
:param session: SQL Alchemy Sessions
"""
if serialize_json:
stored_value = json.dumps(value, indent=2)
else:
stored_value = str(value)
Variable.delete(key, session=session)
session.add(Variable(key=key, val=stored_value))
session.flush()
Is it possible to pre-define variables, connections etc. in a file so that they are loaded when Airflow starts up? Setting them through the UI is not great from a deployment perspective.
Cheers
Terry
I'm glad that someone asked this question. In fact since Airflow completely exposes the underlying SQLAlchemy models to end-user, programmatic manipulation (creation, updation & deletion) of all Airflow models, particularly those used to supply configs like Connection & Variable is possible.
It may not be very obvious, but the open-source nature of Airflow means there are no secrets: you just need to peek in harder. Particularly for these use-cases, I've always found the cli.py to be very useful reference point.
So here's the snippet I use to create all MySQL connections while setting up Airflow. The input file supplied is of JSON format with the given structure.
# all imports
import json
from typing import List, Dict, Any, Optional
from airflow.models import Connection
from airflow.settings import Session
from airflow.utils.db import provide_session
from sqlalchemy.orm import exc
# trigger method
def create_mysql_conns(file_path: str) -> None:
"""
Reads MySQL connection settings from a given JSON file and
persists it in Airflow's meta-db. If connection for same
db already exists, it is overwritten
:param file_path: Path to JSON file containing MySQL connection settings
:type file_path: str
:return: None
:type: None
"""
with open(file_path) as json_file:
json_data: List[Dict[str, Any]] = json.load(json_file)
for settings_dict in json_data:
db_name: str = settings_dict["db"]
conn_id: str = "mysql.{db_name}".format(db_name=db_name)
mysql_conn: Connection = Connection(conn_id=conn_id,
conn_type="mysql",
host=settings_dict["host"],
login=settings_dict["user"],
password=settings_dict["password"],
schema=db_name,
port=settings_dict.get("port", mysql_conn_description["port"]))
create_and_overwrite_conn(conn=mysql_conn)
# utility delete method
#provide_session
def delete_conn_if_exists(conn_id: str, session: Optional[Session] = None) -> bool:
# Code snippet borrowed from airflow.bin.cli.connections(..)
try:
to_delete: Connection = (session
.query(Connection)
.filter(Connection.conn_id == conn_id)
.one())
except exc.NoResultFound:
return False
except exc.MultipleResultsFound:
return False
else:
session.delete(to_delete)
session.commit()
return True
# utility overwrite method
#provide_session
def create_and_overwrite_conn(conn: Connection, session: Optional[Session] = None) -> None:
delete_conn_if_exists(conn_id=conn.conn_id)
session.add(conn)
session.commit()
input JSON file structure
[
{
"db": "db_1",
"host": "db_1.hostname.com",
"user": "db_1_user",
"password": "db_1_passwd"
},
{
"db": "db_2",
"host": "db_2.hostname.com",
"user": "db_2_user",
"password": "db_2_passwd"
}
]
Reference links
With code, how do you update an airflow variable?
Problem updating the connections in Airflow programatically
How to create, update and delete airflow variables without using the GUI?
Programmatically clear the state of airflow task instances
airflow/api/common/experimental/pool.py
I created a migration and ran it. It says it worked fine, but nothing happened. I don't think it is even connecting to my database.
My Migration file:
var util = require("util");
module.exports = {
up : function(migration, DataTypes, done) {
migration.createTable('nameOfTheNewTable', {
attr1 : DataTypes.STRING,
attr2 : DataTypes.INTEGER,
attr3 : {
type : DataTypes.BOOLEAN,
defaultValue : false,
allowNull : false
}
}).success(
function() {
migration.describeTable('nameOfTheNewTable').success(
function(attributes) {
util.puts("nameOfTheNewTable Schema: "
+ JSON.stringify(attributes));
done();
});
});
},
down : function(migration, DataTypes, done) {
// logic for reverting the changes
}
};
My Config.json:
{
"development": {
"username": "user",
"password": "pw",
"database": "my-db",
"dialect" : "sqlite",
"host": "localhost"
}
}
The command:
./node_modules/sequelize/bin/sequelize --migrate --env development
Loaded configuration file "config/config.json".
Using environment "development".
Running migrations...
20130921234513-initial.js
nameOfTheNewTable Schema: {"attr1":{"type":"VARCHAR(255)","allowNull":true,"defaultValue":null},"attr2":{"type":"INTEGER","allowNull":true,"defaultValue":null},"attr3":{"type":"TINYINT(1)","allowNull":false,"defaultValue":false}}
Completed in 8ms
I can run this over and over and the output is always the same. I've tried it on a database which I know to have existing tables and try to describe those tables and still nothing happens.
Am I doing something wrong?
EDIT:
I'm pretty sure I'm not connecting to the db, but try as I might I cannot connect using the migration. I can connect using sqlite3 my-db.sqlite and run commands such as .tables to see tables I have created previously, but I cannot for the life of me get the "nameOfTheNewTable" table created using a migration. (I want to create indexes in the migration too). I have tried using "development", changing values in the config.json like the host, database (my-db, ../my-db, my-db.sqlite), etc.
Here's a good example, in the config.json I put "database" : "bad-db" and the output from the migration is exactly the same. When it is done, there is no bad-db.sqlite file to be found.
You need to specify the 'storage' parameter in your config.json, so that sequelize knows what file to use as the sqlite DB.
Sequelize defaults to using memory storage for sqlite, so it's migrating an in-memory database, then exiting, effectively destroying the DB it just migrated.
you most likely have to wait for migration.createTable to finish:
migration.createTable(/*your options*/).success(function() {
migration.describeTable('nameOfTheNewTable').success(function(attributes) {
util.puts("nameOfTheNewTable Schema: " + JSON.stringify(attributes));
done()
});
})