MssqlHook airflow connection - airflow

I am new to using airflow and what I need to do is to use MssqlHook but I do not know how. What elements should I give in the constructor?
I have a connection in airflow with name connection_test.
I do not fully understand the attributes in the class:
class MsSqlHook(DbApiHook):
"""
Interact with Microsoft SQL Server.
"""
conn_name_attr = 'mssql_conn_id'
default_conn_name = 'mssql_default'
supports_autocommit = True
I have the following code:
sqlhook=MsSqlHook(connection_test)
sqlhook.get_conn()
And when I do this the error is Connection failed for unknown reason.
How should I do in order to make it work with the airflow connection ?
What I need is to call function .get_conn() for the MsSqlHook.

See the standard examples of Airflow.
https://github.com/gtoonstra/etl-with-airflow/blob/master/examples/mssql-example/dags/mssql_bcp_example.py
E.g.:
t1 = MsSqlImportOperator(task_id='import_data',
table_name='test.test',
generate_synth_data=generate_synth_data,
mssql_conn_id='mssql',
dag=dag)
EDIT
hook = MsSqlHook(mssql_conn_id="my_mssql_conn")
hook.run(sql)
You need to provide the connection defined in Connections. Also if using Hooks looking in the respective Operators usually yields some information about usage. This code is from the MSSQLOperator.

Related

How to specify which GCP project to use when triggering a pipeline through Data Fusion operator on Cloud Composer

I need to to trigger a Data Fusion pipeline located on a GCP project called myDataFusionProject through a Data Fusion operator (CloudDataFusionStartPipelineOperator) inside a DAG whose Cloud Composer instance is located on another project called myCloudComposerProject.
I have used the official documentation as well as the source code to write the code that roughly resembles the below snippet:
LOCATION = "someLocation"
PIPELINE_NAME = "myDataFusionPipeline"
INSTANCE_NAME = "myDataFusionInstance"
RUNTIME_ARGS = {"output.instance":"someOutputInstance", "input.dataset":"someInputDataset", "input.project":"someInputProject"}
start_pipeline = CloudDataFusionStartPipelineOperator(
location=LOCATION,
pipeline_name=PIPELINE_NAME,
instance_name=INSTANCE_NAME,
runtime_args=RUNTIME_ARGS,
task_id="start_pipeline",
)
My issue is that, every time I trigger the DAG, Cloud Composer looks for myDataFusionInstance inside myCloudComposerProject instead of myDataFusionProject, which gives an error like this one:
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://datafusion.googleapis.com/v1beta1/projects/myCloudComposerProject/locations/someLocation/instances/myDataFusionInstance?alt=json returned "Resource 'projects/myCloudComposerProject/locations/someLocation/instances/myDataFusionInstance' was not found". Details: "[{'#type': 'type.googleapis.com/google.rpc.ResourceInfo', 'resourceName': 'projects/myCloudComposerProject/locations/someLocation/instances/myDataFusionInstance'}]"
So the question is: how can I force my operator to use the Data Fusion project instead of the Cloud Composer project? I suspect I may do that by adding a new runtime argument but I'm not sure how to do that.
Last piece of information: the Data Fusion pipeline simply extracts data from a BigQuery source and sends everything to a BigTable sink.
As a recommendation while developing operators on airflow, we should check the classes that are implementing the operators as documentation may lack some information due to versioning.
As commented, if you check CloudDataFusionStartPipelineOperator you will find that it makes use of a hook that gets the instance base on a project-id. This project-id its optional, so you can add your own project-id.
class CloudDataFusionStartPipelineOperator(BaseOperator):
...
def __init__(
...
project_id: Optional[str] = None, ### NOT MENTION IN THE DOCUMENTATION
...
) -> None:
...
self.project_id = project_id
...
def execute(self, context: dict) -> str:
...
instance = hook.get_instance(
instance_name=self.instance_name,
location=self.location,
project_id=self.project_id, ### defaults your project-id
)
api_url = instance["apiEndpoint"]
...
Adding the parameter to your operator call should fix your issue.
start_pipeline = CloudDataFusionStartPipelineOperator(
location=LOCATION,
pipeline_name=PIPELINE_NAME,
instance_name=INSTANCE_NAME,
runtime_args=RUNTIME_ARGS,
project_id=PROJECT_ID,
task_id="start_pipeline",
)
As a final note, besides the official documentation site you can also explore the files of apache airflow on github.

Using Snowflake & Apache Airflow, how can I change DATABASE or SCHEMA for a connection using SqlSensor?

I'm new to Snowflake (and using it in Apache Airflow). I'm attempting to use the SqlSensor with it.
I've found that the Snowflake driver doesn't allow multi-statement SQL. The SnowflakeOperator appears to have worked around this by splitting on ; and executing the statements one at a time.
Snowflake is a valid connection type for SqlOperator. But, I haven't noticed an SqlSensor specifically for Snowflake. The SqlOperator doesn't split commands the way the SnowflakeOperator does. Therefore it seems like each query to use for sensing must be established on the right database and schema from the get go, i.e. no USE DATABASE type commands - otherwise it is multi-statement and the sensor fails.
Do I need to establish a separate airflow connection for each Schema and Database I might choose to sense? Or is there another way to specify the DATABASE, SCHEMA, etc at runtime using the SqlSensor?
SqlSensor is designed to accept single query.
I assume that you are trying to run queries like:
USE DATABASE my_database;
USE SCHEMA my_schema;
SELECT ...
Which doesn't work. For the moment there is no out of the box solution. A PR is in progress to handle it by exposing the underlying hooks of SqlSensor allowing to pass parameters.
However you can still solve your problem with one of the following:
Define another Snowflake connection that you will not need to change its properties in the query thus you will run only a single SELECT statement.
create a custom SnowflakeSensor.
I didn't test it but the general idea is something like
from airflow.sensors.sql import SqlSensor
from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook
class SnowflakeSensor(SqlSensor):
def __init__(
self, **kwargs
):
self.account = kwargs.pop("account", None)
self.warehouse = kwargs.pop("warehouse", None)
self.database = kwargs.pop("database", None)
self.region = kwargs.pop("region", None)
self.role = kwargs.pop("role", None)
self.schema = kwargs.pop("schema", None)
self.authenticator = kwargs.pop("authenticator", None)
self.session_parameters = kwargs.pop("session_parameters", None)
super().__init__(**kwargs)
def _get_hook(self):
return SnowflakeHook(
snowflake_conn_id=self.conn_id,
warehouse=self.warehouse,
database=self.database,
role=self.role,
schema=self.schema,
authenticator=self.authenticator,
session_parameters=self.session_parameters,
)
What this code does is everything that SqlSensor does with the difference that it accept also Snowflake specific parameters by overwriting the _get_hook function.

The conn_id isn't defined

I'm learning Airflow and I'm trying to understand how connections works.
I have a first dag with the following code:
c = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxx',
password='xxxxxxxxx'
)
def list_keys():
hook = S3Hook(aws_conn_id=c.conn_id)
logging.info(f"Listing Keys from {bucket}/{prefix}")
keys = hook.list_keys(bucket, prefix=prefix)
for key in keys:
logging.info(f"- s3://{bucket}/{key}")
In this case It's working fine. The connection is well passed to the S3Hook.
Then I have a second dag:
redshift_connection = Connection(
conn_id='redshift',
conn_type='postgres',
login='duser',
password='xxxxxxxxxx',
host='xxxxxxxx.us-west-2.redshift.amazonaws.com',
port=5439,
schema='db'
)
aws_connection = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxxx',
password='xxxxxxxx'
)
def load_data_to_redshift(*args, **kwargs):
aws_hook = AwsHook(aws_connection.conn_id)
credentials = aws_hook.get_credentials()
redshift_hook = PostgresHook(redshift_connection.conn_id)
sql_stmnt = sql_statements.COPY_STATIONS_SQL.format(aws_connection.login, aws_connection.password)
redshift_hook.run(sql_stmnt)
dag = DAG(
's3_to_Redshift',
start_date=datetime.datetime.now()
)
create_table = PostgresOperator(
task_id='create_table',
postgres_conn_id=redshift_connection.conn_id,
sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
dag=dag
)
This dag return me the following error: The conn_idredshiftisn't defined
Why is that? What are the differences between my first and second dag? Why the connection does seems to work in the first example and not in the second situation?
Thanks.
Connections are usually created using the UI or CLI as described here and stored by Airflow in the database backend. The operators and the respective hooks then take a connection ID as an argument and use it to retrieve the usernames, passwords, etc. for those connections.
In your case, I suspect you created a connection with the ID aws_credentials using the UI or CLI. So, when you pass its ID to S3Hook it successfully retrieves the credentials (from the Airflow's database backend, not from the Connection object that you created).
But, you did not create a connection with the ID redshift, therefore, AwsHook complains that it is not defined. You have to create the connection as described in the documentation first.
Note: The reason for not defining connections in the DAG code is that the DAG code is usually stored in a version control system (e.g., Git). And it would be a security risk to store credentials there.

How can I log sql execution results in airflow?

I use airflow python operators to execute sql queries against a redshift/postgres database. In order to debug, I'd like the DAG to return the results of the sql execution, similar to what you would see if executing locally in a console:
I'm using psycop2 to create a connection/cursor and execute the sql. Having this logged would be extremely helpful to confirm the parsed parameterized sql, and confirm that data was actually inserted (I have painfully experiences issues where differences in environments caused unexpected behavior)
I do not have deep knowledge of airflow or the low level workings of the python DBAPI, but the pscyopg2 documentation does seem to refer to some methods and connection configurations that may allow this.
I find it very perplexing that this is difficult to do, as I'd imagine it would be a primary use case of running ETLs on this platform. I've heard suggestions to simply create additional tasks that query the table before and after, but this seems clunky and ineffective.
Could anyone please explain how this may be possible, and if not, explain why? Alternate methods of achieving similar results welcome. Thanks!
So far I have tried the connection.status_message() method, but it only seems to return the first line of the sql and not the results. I have also attempted to create a logging cursor, which produces the sql, but not the console results
import logging
import psycopg2 as pg
from psycopg2.extras import LoggingConnection
conn = pg.connect(
connection_factory=LoggingConnection,
...
)
conn.autocommit = True
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stdout))
conn.initialize(logger)
cur = conn.cursor()
sql = """
INSERT INTO mytable (
SELECT *
FROM other_table
);
"""
cur.execute(sql)
I'd like the logger to return something like:
sql> INSERT INTO mytable (
SELECT ...
[2019-07-25 23:00:54] 912 rows affected in 4 s 442 ms
Let's assume you are writing an operator that uses postgres hook to do something in sql.
Anything printed inside an operator is logged.
So, if you want to log the statement, just print the statement in your operator.
print(sql)
If you want to log the result, fetch the result and print the result.
E.g.
result = cur.fetchall()
for row in result:
print(row)
Alternatively you can use self.log.info in place of print, where self refers to the operator instance.
Ok, so after some trial and error I've found a method that works for my setup and objective. To recap, my goal is to run ETL's via python scripts, orchestrated in Airflow. Referring to the documentation for statusmessage:
Read-only attribute containing the message returned by the last command:
The key is to manage logging in context with transactions executed on the server. In order for me to do this, I had to specifically set con.autocommit = False, and wrap SQL blocks with BEGIN TRANSACTION; and END TRANSACTION;. If you insert cur.statusmessage directly following a statement that deletes or inserts, you will get a response such as 'INSERT 0 92380'.
This still isn't as verbose as I would prefer, but it is a much better than nothing, and is very useful for troubleshooting ETL issues within Airflow logs.
Side notes:
- When autocommit is set to False, you must explicitly commit transactions.
- It may not be necessary to state transaction begin/end in your SQL. It may depend on your DB version.
con = psy.connect(...)
con.autocommit = False
cur = con.cursor()
try:
cur.execute([some_sql])
logging.info(f"Cursor statusmessage: {cur.statusmessage})
except:
con.rollback()
finally:
con.close()
There is some buried functionality within psycopg2 that I'm sure can be utilized, but the documentation is pretty thin and there are no clear examples. If anyone has suggestions on how to utilize things such as logobjects, or returning join PID to somehow retrieve additional information.

Airflow Custom Metrics and/or Result Object with custom fields

While running pySpark SQL pipelines via Airflow I am interested in getting out some business stats like:
source read count
target write count
sizes of DFs during processing
error records count
One idea is to push it directly to the metrics, so it will gets automatically consumed by monitoring tools like Prometheus. Another idea is to obtain these values via some DAG result object, but I wasn't able to find anything about it in docs.
Please post some at least pseudo code if you have solution.
I would look to reuse Airflow's statistics and monitoring support in the airflow.stats.Stats class. Maybe something like this:
import logging
from airflow.stats import Stats
PYSPARK_LOG_PREFIX = "airflow_pyspark"
def your_python_operator(**context):
[...]
try:
Stats.incr(f"{PYSPARK_LOG_PREFIX}_read_count", src_read_count)
Stats.incr(f"{PYSPARK_LOG_PREFIX}_write_count", tgt_write_count)
# So on and so forth
except:
logging.exception("Caught exception during statistics logging")
[...]

Resources