Create a dynamic database connection in Airflow DAG - airflow

I am using Apache-Airflow 2.2.3 and I know we can create connections via admin/connections. But I trying for a way to create a connection using dynamic DB server details.
My DB host, user, password details are coming through the DAGRun input config and I need to read and write the data to DB.

You can read connection details from the DAGRun config:
# Say we gave input {"username": "foo", "password": "bar"}
from airflow.models.connection import Connection
def mytask(**context):
username = context["dag_run"].conf["username"]
password = context["dag_run"].conf["password"]
connection = Connection(login=username, password=password)
However, all operators (that require a connection) in Airflow take an argument conn_id that takes a string identifying the connection in the metastore/env var/secrets backend. At the moment it is not possible to provide a Connection object.
Therefore, if you implement your own Python functions (and use the PythonOperator or #task decorator) or implement your own operators, you should be able to create a Connection object and perform whatever logic using that. But using any other existing operators in Airflow will not be possible.

Related

Provide aws credentials to Airflow GreatExpectationsOperator

I would like to use GreatExpectationsOperator to perform data quality validations.
The validation results data should be stored in S3.
I don't see an option to send an airflow connection name to the GE operator, and the AWS credentials in my organization are stored in an airflow connection.
How can great expectations retrieve s3 credentials from airflow connection? and not from the default aws credentials in .aws dir?
Thanks!
We ended up creating a new oprator that inherit from GE operator and the operator get the connection as part of its ecxeute method.

Airflow incorrectly always picks the built in default credentials under Admin-> Connections

I am trying to connect to mysql database using MySqlHook. Under Admin -> Connections I have defined a new connection of the type mysql with the name myappname_db. I have used this in my code as drupalHook = MySqlHook(conn_name_attr='myappname_db')
However, when I run the dag in local, I see that it picks up the built in default credential from Admin -> Credentials i.e. mysql_default instead of myappname_db
To rectify this do I need to update any setting in airflow.cfg or any other config.
Thanks.
The MySqlHook uses an attribute mysql_conn_id to store the connection id.
conn_name_attr refers to the name of the attribute (mysql_conn_id in this case), which is fetched dynamically in the MySqlHook.

Airflow logs in s3 bucket

Would like to write the airflow logs to s3. Following are the parameter that we need to set according to the doc-
remote_logging = True
remote_base_log_folder =
remote_log_conn_id =
If Airflow is running in AWS, why do I have to pass the AWS keys? Shouldn't the boto3 API be able to write/read to s3 if correct permission are set on IAM role attached to the instance?
Fair point, but I think it allows for more flexibility if Airflow is not running on AWS or if you want to use a specific set of credentials rather than give the entire instance access. It might have also been easier implementation as well because the underlying code for writing logs into S3 uses the S3Hook (https://github.com/apache/airflow/blob/1.10.9/airflow/utils/log/s3_task_handler.py#L47), which requires a connection id.

How to mask the credentials in the Airflow logs?

I want to make sure some of the encrypted variables does not appear in airflow log.
I am passing AWS keys to Exasol Export sql, in the Airflow log it is getting printed.
Currently, this is not possible out-of-the-box. You can, however, configure your own Python Logger and use that class by changing logging_config_class property in airflow.cfg file.
Example here: Mask out sensitive information in python log
Are the AWS keys sent as part of the data for SQL export or are they sent for the connection?
If they are sent for connection, then hiding these credentials is possible. You simply would have to create a connection and send export the data from the connection.

SQLAlchemy Connections, pooling, and SQLite

so, my design calls for a separate SQLite file for each "project".. I am reading through the SQLAlchemy Pooling docs more carefully.. my guess right now is that I dont want to fool with pooling at all, but this is really a separate connection engine for each project.. Agree??
In that case, when I create the engine, either I connect to a file named by convention, or create a new SQLite file and supply a schema template... ??
Ehm, what? Connection Pools contain many connections to the same (database) server. It takes time to establish a new connection, so when there are many short-lived processes using the same database, it's handy to have a pool of already established connections. The processes can check out a connection, do their thing and return it, without having wait while opening a new connection.
In any case, all connections go to the same database, given by the URI passed to create_engine
First, some vocabulary. SQLAlchemy defines schemas with MetaData objects, containing objects representing tables and other database entities. Metadata objects can be optionally "bound" to engines that are what you think of as "pools."
To create a standard schema and use it across multiple databases, you'll want to create one metadata object and use it with several engines, each engine being a database you connect to. Here's an example, from the interactive iPython prompt. Note that each of these SQLite engines connect to different in-memory databases; connection1 and connection2 do not connect to the same database:
In [1]: from sqlalchemy import *
In [2]: metadata = MetaData()
In [3]: users_table = Table('users', metadata,
...: Column('id', Integer, primary_key=True),
...: Column('name', String))
In [4]: connection1 = create_engine('sqlite:///:memory:')
In [5]: connection2 = create_engine('sqlite:///:memory:')
In [6]: ## Create necessary tables
In [7]: metadata.create_all(bind=connection1)
In [8]: metadata.create_all(bind=connection2)
In [9]: ## Insert data
In [10]: connection1.execute(
users_table.insert(values={'name': 'Mike'}, bind=connection1))
In [11]: connection2.execute(
users_table.insert(values={'name': 'Jim'}, bind=connection2))
In [12]: print connection1.execute(users_table.select(bind=connection1)).fetchall()
[(1, u'Mike')]
In [13]: print connection2.execute(users_table.select(bind=connection2)).fetchall()
[(1, u'Jim')]
As you can see, I connect to two sqlite databases and executed statements on each using a common schema stored in my metedata object. If I were you, I'd start by just using the create_engine method and not worry about pooling. When it comes time to optimize, you can tweak how databases are connected to using arguments to create_engine.

Resources