I'm learning Airflow and am planning to set some variables to use across different tasks. These are in my dags folder, saved as configs.json, like so:
{
"vars": {
"task1_args": {
"something": "This is task 1"
},
"task2_args": {
"something": "this is task 2"
}
}
}
I get that we can enter Admin-->Variables--> upload the file. But I have 2 questions:
What if I want to adjust some of the variables while airflow is running? I can adjust my code easily and it updates in realtime but it doesn't seem like this works for variables.
Is there a way to just auto-import this specific file on startup? I don't want to have to add it every time I'm testing my project.
I don't see this mentioned in the docs but it seems like a pretty trivial thing to want.
What you are looking for is With code, how do you update an airflow variable?
Here's an untested snippet that should help
from airflow.models import Variable
Variable.set(key="my_key", value="my_value")
So basically you can write a bootstrap python script to do this setup for you.
In our team, we use such scripts to setup all Connections, and Pools too
In case you are wondering, here's the set(..) method from source
#classmethod
#provide_session
def set(
cls,
key: str,
value: Any,
serialize_json: bool = False,
session: Session = None
):
"""
Sets a value for an Airflow Variable with a given Key
:param key: Variable Key
:param value: Value to set for the Variable
:param serialize_json: Serialize the value to a JSON string
:param session: SQL Alchemy Sessions
"""
if serialize_json:
stored_value = json.dumps(value, indent=2)
else:
stored_value = str(value)
Variable.delete(key, session=session)
session.add(Variable(key=key, val=stored_value))
session.flush()
Related
Airflow version :2.0.2
Trying to create Emr cluster, by retrying data from AWS secrets manager.
I am trying to write an airflow dag and, my task is to get data from this get_secret function and use it in Spark_steps
def get_secret():
secret_name = Variable.get("secret_name")
region_name = Variable.get(region_name)
# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(service_name='secretsmanager', region_name=region_name)
account_id = boto3.client('sts').get_caller_identity().get('Account')
try:
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
if 'SecretString' in get_secret_value_response:
secret_str = get_secret_value_response['SecretString']
secret=json.loads(secret_str)
airflow_path=secret["airflow_path"]
return airflow_path
...
I need to use "airflow_path" return value in below spark_steps
Spark_steps:
SPARK_STEPS = [
{
'Name': 'Spark-Submit Command',
"ActionOnFailure": "CONTINUE",
'HadoopJarStep': {
"Jar": "command-runner.jar",
"Args": [
'spark-submit',
'--py-files',
's3://'+airflow_path+'-pyspark/pitchbook/config.zip,s3://'+airflow_path+'-pyspark/pitchbook/jobs.zip,s3://'+airflow_path+'-pyspark/pitchbook/DDL.zip',
's3://'+airflow_path+'-pyspark/pitchbook/main.py'
],
},
},
I saw on the internet I need to use Xcom, is this right ?, and do I need to run this function in python operator first and then get the value. please provide an example as I am a newbie.
Thanks for your help.
Xi
Yes if you would like to pass dynamic stuff, leveraging xcom push/pull might be easier.
Leverage PythonOperator to push data into xcom.
See reference implementation:
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py#L108
https://www.startdataengineering.com/post/how-to-submit-spark-jobs-to-emr-cluster-from-airflow/
I'm using Airflow in Google Composer. By default all the tasks in a DAG use the default connection to communicate with Storage, BigQuery, etc. Obviously we can specify another connection configured in Airflow, ie:
task_custom = bigquery.BigQueryInsertJobOperator(
task_id='task_custom_connection',
gcp_conn_id='my_gcp_connection',
configuration={
"query": {
"query": 'SELECT 1',
"useLegacySql": False
}
}
)
Is it possible to use a specific connection as the default for all tasks in the entire DAG?
Thanks in advance.
UPDATE:
Specify gcp_conn_id via default_args in DAG (as Javier Lopez Tomas recommended) doesn't work completely. The Operators that expect gcp_conn_id as parameter works fine, but in my case unfortunately most of interactions with GCP components do so via clients or hooks within PythonOperators.
For example: If I call DataflowHook (inside a function called by a PythonOperator) without specifying the connection, it internally uses "google_cloud_default" and not "gcp_conn_id" :(
def _dummy_func(**context):
df_hook = DataflowHook()
default_args = {
'gcp_conn_id': 'my_gcp_connection'
}
with DAG(default_args=default_args) as dag:
dummy = PythonOperator(python_callable=_dummy_func)
You can use default args:
https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#default-arguments
In your case it would be:
default_args = {
"gcp_conn_id": "my_gcp_connection"
}
with DAG(blabla)...
Is it possible to pre-define variables, connections etc. in a file so that they are loaded when Airflow starts up? Setting them through the UI is not great from a deployment perspective.
Cheers
Terry
I'm glad that someone asked this question. In fact since Airflow completely exposes the underlying SQLAlchemy models to end-user, programmatic manipulation (creation, updation & deletion) of all Airflow models, particularly those used to supply configs like Connection & Variable is possible.
It may not be very obvious, but the open-source nature of Airflow means there are no secrets: you just need to peek in harder. Particularly for these use-cases, I've always found the cli.py to be very useful reference point.
So here's the snippet I use to create all MySQL connections while setting up Airflow. The input file supplied is of JSON format with the given structure.
# all imports
import json
from typing import List, Dict, Any, Optional
from airflow.models import Connection
from airflow.settings import Session
from airflow.utils.db import provide_session
from sqlalchemy.orm import exc
# trigger method
def create_mysql_conns(file_path: str) -> None:
"""
Reads MySQL connection settings from a given JSON file and
persists it in Airflow's meta-db. If connection for same
db already exists, it is overwritten
:param file_path: Path to JSON file containing MySQL connection settings
:type file_path: str
:return: None
:type: None
"""
with open(file_path) as json_file:
json_data: List[Dict[str, Any]] = json.load(json_file)
for settings_dict in json_data:
db_name: str = settings_dict["db"]
conn_id: str = "mysql.{db_name}".format(db_name=db_name)
mysql_conn: Connection = Connection(conn_id=conn_id,
conn_type="mysql",
host=settings_dict["host"],
login=settings_dict["user"],
password=settings_dict["password"],
schema=db_name,
port=settings_dict.get("port", mysql_conn_description["port"]))
create_and_overwrite_conn(conn=mysql_conn)
# utility delete method
#provide_session
def delete_conn_if_exists(conn_id: str, session: Optional[Session] = None) -> bool:
# Code snippet borrowed from airflow.bin.cli.connections(..)
try:
to_delete: Connection = (session
.query(Connection)
.filter(Connection.conn_id == conn_id)
.one())
except exc.NoResultFound:
return False
except exc.MultipleResultsFound:
return False
else:
session.delete(to_delete)
session.commit()
return True
# utility overwrite method
#provide_session
def create_and_overwrite_conn(conn: Connection, session: Optional[Session] = None) -> None:
delete_conn_if_exists(conn_id=conn.conn_id)
session.add(conn)
session.commit()
input JSON file structure
[
{
"db": "db_1",
"host": "db_1.hostname.com",
"user": "db_1_user",
"password": "db_1_passwd"
},
{
"db": "db_2",
"host": "db_2.hostname.com",
"user": "db_2_user",
"password": "db_2_passwd"
}
]
Reference links
With code, how do you update an airflow variable?
Problem updating the connections in Airflow programatically
How to create, update and delete airflow variables without using the GUI?
Programmatically clear the state of airflow task instances
airflow/api/common/experimental/pool.py
I currently have a value that is stored as an environment variable the environment where a jupyter server is running. I would like to somehow pass that value to a frontend extension. It does not have to read the environment variable in real time, I am fine with just using the value of the variable at startup. Is there a canonical way to pass parameters a frontend extension on startup? Would appreciate an examples of both setting the parameter from the backend and accessing it from the frontend.
[update]
I have posted a solution that works for nbextentions, but I can't seem to find the equivalent pattern for labextensions (typescript), any help there would be much appreciated.
I was able to do this by adding the following code to my jupter_notebook_config.py
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('notebook', {'variable_being_set': value})
Then I had the parameters defined in my extension in my main.js
// define default values for config parameters
var params = {
variable_being_set : 'default'
};
// to be called once config is loaded, this updates default config vals
// with the ones specified by the server's config file
var update_params = function() {
var config = Jupyter.notebook.config;
for (var key in params) {
if (config.data.hasOwnProperty(key) ){
params[key] = config.data[key];
}
}
};
I also have the parameters declared in my main.yaml
Parameters:
- name: variable_being_set
description: ...
input_type: text
default: `default_value`
This took some trial and error to find out because there is very little documentation on the ConfigManager class and none of it has an end-to-end example.
I have an Airflow DAG that starts an AWS EMR cluster to run steps. On the DAG we pass some variables that are set on Airflow Variables. But some of these variables are encrypted at Airflow, but when passing to EMR, we can see then clearly at EMR console. Is there any way to hide this?
Here is how we are defining the step. The airflow variable db_pass must be encrypted or hidden somehow
{
"Name": "EMR JOB",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"{{var.value.job_script}}",
"--database_user={{var.value.db_user}}",
"--database_pass={{var.value.db_pass}}"
]
}
}
]
This SAMPLE_STEP_DEFINITION is then passed as to the EmrAddStepsOperator:
...
sample_task = EmrAddStepsOperator(
steps=SAMPLE_STEP_DEFINITION,
...
There are several ways to do this. First I would suggest to encrypt passwords with KMS. Here is the code how to do it:
def encryptString(plainText: String, keyArn: String): String = {
val req = new EncryptRequest().withKeyId(keyArn).withPlaintext(ByteBuffer.wrap(plainText.getBytes))
Base64.getEncoder.encodeToString(kmsClient.encrypt(req).getCiphertextBlob.array())
}
def decryptString(encryptedText: String, keyArn: String): String = {
val req = new DecryptRequest().withCiphertextBlob(ByteBuffer.wrap(Base64.getDecoder.decode(encryptedText)))
new String(kmsClient.decrypt(req).getPlaintext.array())
}
Your just need to attach decrypt permission to EMR_EC2_DefaultRole.
Another way is to pass a config file stored on S3 with password.