How can I log sql execution results in airflow? - airflow

I use airflow python operators to execute sql queries against a redshift/postgres database. In order to debug, I'd like the DAG to return the results of the sql execution, similar to what you would see if executing locally in a console:
I'm using psycop2 to create a connection/cursor and execute the sql. Having this logged would be extremely helpful to confirm the parsed parameterized sql, and confirm that data was actually inserted (I have painfully experiences issues where differences in environments caused unexpected behavior)
I do not have deep knowledge of airflow or the low level workings of the python DBAPI, but the pscyopg2 documentation does seem to refer to some methods and connection configurations that may allow this.
I find it very perplexing that this is difficult to do, as I'd imagine it would be a primary use case of running ETLs on this platform. I've heard suggestions to simply create additional tasks that query the table before and after, but this seems clunky and ineffective.
Could anyone please explain how this may be possible, and if not, explain why? Alternate methods of achieving similar results welcome. Thanks!
So far I have tried the connection.status_message() method, but it only seems to return the first line of the sql and not the results. I have also attempted to create a logging cursor, which produces the sql, but not the console results
import logging
import psycopg2 as pg
from psycopg2.extras import LoggingConnection
conn = pg.connect(
connection_factory=LoggingConnection,
...
)
conn.autocommit = True
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stdout))
conn.initialize(logger)
cur = conn.cursor()
sql = """
INSERT INTO mytable (
SELECT *
FROM other_table
);
"""
cur.execute(sql)
I'd like the logger to return something like:
sql> INSERT INTO mytable (
SELECT ...
[2019-07-25 23:00:54] 912 rows affected in 4 s 442 ms

Let's assume you are writing an operator that uses postgres hook to do something in sql.
Anything printed inside an operator is logged.
So, if you want to log the statement, just print the statement in your operator.
print(sql)
If you want to log the result, fetch the result and print the result.
E.g.
result = cur.fetchall()
for row in result:
print(row)
Alternatively you can use self.log.info in place of print, where self refers to the operator instance.

Ok, so after some trial and error I've found a method that works for my setup and objective. To recap, my goal is to run ETL's via python scripts, orchestrated in Airflow. Referring to the documentation for statusmessage:
Read-only attribute containing the message returned by the last command:
The key is to manage logging in context with transactions executed on the server. In order for me to do this, I had to specifically set con.autocommit = False, and wrap SQL blocks with BEGIN TRANSACTION; and END TRANSACTION;. If you insert cur.statusmessage directly following a statement that deletes or inserts, you will get a response such as 'INSERT 0 92380'.
This still isn't as verbose as I would prefer, but it is a much better than nothing, and is very useful for troubleshooting ETL issues within Airflow logs.
Side notes:
- When autocommit is set to False, you must explicitly commit transactions.
- It may not be necessary to state transaction begin/end in your SQL. It may depend on your DB version.
con = psy.connect(...)
con.autocommit = False
cur = con.cursor()
try:
cur.execute([some_sql])
logging.info(f"Cursor statusmessage: {cur.statusmessage})
except:
con.rollback()
finally:
con.close()
There is some buried functionality within psycopg2 that I'm sure can be utilized, but the documentation is pretty thin and there are no clear examples. If anyone has suggestions on how to utilize things such as logobjects, or returning join PID to somehow retrieve additional information.

Related

Using Snowflake & Apache Airflow, how can I change DATABASE or SCHEMA for a connection using SqlSensor?

I'm new to Snowflake (and using it in Apache Airflow). I'm attempting to use the SqlSensor with it.
I've found that the Snowflake driver doesn't allow multi-statement SQL. The SnowflakeOperator appears to have worked around this by splitting on ; and executing the statements one at a time.
Snowflake is a valid connection type for SqlOperator. But, I haven't noticed an SqlSensor specifically for Snowflake. The SqlOperator doesn't split commands the way the SnowflakeOperator does. Therefore it seems like each query to use for sensing must be established on the right database and schema from the get go, i.e. no USE DATABASE type commands - otherwise it is multi-statement and the sensor fails.
Do I need to establish a separate airflow connection for each Schema and Database I might choose to sense? Or is there another way to specify the DATABASE, SCHEMA, etc at runtime using the SqlSensor?
SqlSensor is designed to accept single query.
I assume that you are trying to run queries like:
USE DATABASE my_database;
USE SCHEMA my_schema;
SELECT ...
Which doesn't work. For the moment there is no out of the box solution. A PR is in progress to handle it by exposing the underlying hooks of SqlSensor allowing to pass parameters.
However you can still solve your problem with one of the following:
Define another Snowflake connection that you will not need to change its properties in the query thus you will run only a single SELECT statement.
create a custom SnowflakeSensor.
I didn't test it but the general idea is something like
from airflow.sensors.sql import SqlSensor
from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook
class SnowflakeSensor(SqlSensor):
def __init__(
self, **kwargs
):
self.account = kwargs.pop("account", None)
self.warehouse = kwargs.pop("warehouse", None)
self.database = kwargs.pop("database", None)
self.region = kwargs.pop("region", None)
self.role = kwargs.pop("role", None)
self.schema = kwargs.pop("schema", None)
self.authenticator = kwargs.pop("authenticator", None)
self.session_parameters = kwargs.pop("session_parameters", None)
super().__init__(**kwargs)
def _get_hook(self):
return SnowflakeHook(
snowflake_conn_id=self.conn_id,
warehouse=self.warehouse,
database=self.database,
role=self.role,
schema=self.schema,
authenticator=self.authenticator,
session_parameters=self.session_parameters,
)
What this code does is everything that SqlSensor does with the difference that it accept also Snowflake specific parameters by overwriting the _get_hook function.

How can my Flask app check whether a SQLite3 transaction is in progress?

I am trying to build some smart error messages using the #app.errorhandler (500) feature. For example, my route includes an INSERT command to the database:
if request.method == "POST":
userID = int(request.form.get("userID"))
topicID = int(request.form.get("topicID"))
db.execute("BEGIN TRANSACTION")
db.execute("INSERT INTO UserToTopic (userID,topicID) VALUES (?,?)", userID, topicID)
db.execute("COMMIT")
If that transaction violates a constraint, such as UNIQUE or FOREIGN_KEY, I want to catch the error and display a user-friendly message. To do this, I'm using the Flask #app.errorhandler as follows:
#app.errorhandler(500)
def internal_error(error):
db.execute("ROLLBACK")
return render_template('500.html'), 500
The "ROLLBACK" command works fine if I'm in the middle of a database transaction. But sometimes the 500 error is not related to the db, and in those cases the ROLLBACK statement itself causes an error, because you can't rollback a transaction that never started. So I'm looking for a method that returns a Boolean value that would be true if a db transaction is under way, and false if not, so I can use it to make the ROLLBACK conditional. The only one I can find in the SQLite3 documentation is for a C interface, and I can't get it to work with my Python code. Any suggestions?
I know that if I'm careful enough with my forms and routes, I can prevent 99% of potential violations of db rules. But I would still like a smart error catcher to protect me for the other 1%.
I don't know how transaction works in sqlite but what you are trying to do, you can achieve it by try/except statements
use try/except within the function
try:
db.execute("ROLLBACK")
except:
pass
return render_template('500.html'), 500
Use try/except when inserting data.
from flask import abort
try:
userID = int(request.form.get("userID"))
[...]
except:
db.rollback()
abort(500)
I am not familiar with sqlite errors, if you know what specific error occurs except for that specific error.

MssqlHook airflow connection

I am new to using airflow and what I need to do is to use MssqlHook but I do not know how. What elements should I give in the constructor?
I have a connection in airflow with name connection_test.
I do not fully understand the attributes in the class:
class MsSqlHook(DbApiHook):
"""
Interact with Microsoft SQL Server.
"""
conn_name_attr = 'mssql_conn_id'
default_conn_name = 'mssql_default'
supports_autocommit = True
I have the following code:
sqlhook=MsSqlHook(connection_test)
sqlhook.get_conn()
And when I do this the error is Connection failed for unknown reason.
How should I do in order to make it work with the airflow connection ?
What I need is to call function .get_conn() for the MsSqlHook.
See the standard examples of Airflow.
https://github.com/gtoonstra/etl-with-airflow/blob/master/examples/mssql-example/dags/mssql_bcp_example.py
E.g.:
t1 = MsSqlImportOperator(task_id='import_data',
table_name='test.test',
generate_synth_data=generate_synth_data,
mssql_conn_id='mssql',
dag=dag)
EDIT
hook = MsSqlHook(mssql_conn_id="my_mssql_conn")
hook.run(sql)
You need to provide the connection defined in Connections. Also if using Hooks looking in the respective Operators usually yields some information about usage. This code is from the MSSQLOperator.

What does "rerun in upstream in pipeline"?

I am defining a pipeline in data factory, I had some errors that I correct.
The first activity is calling an usql script to do some aggregation, I changed the script plenty of time but the error is still:
[{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC","source":"USER","message":"syntax
error. Final statement did not end with a semicolon","details":"at
token 'usql', line 4\r\nnear the ###:\r\n**************\r\nCLARE
#lineitemsfile string =
\"/datalakerepo/input/2016/01/01lineitems.txt\";\nDECLARE #ordersfile
string = \"/datalakerepo/input/2016/01/01orders.txt\";\nsales.usql ###
\n","description":"Invalid syntax found in the
script.","resolution":"Correct the script syntax, using expected
token(s) as a
guide.","helpLink":"","filePath":"","lineNumber":4,"startOffset":228,"endOffset":232}].
seem like not all usql script is read from the data factory, so I though that may be the "rerun in upstream in pipeline" have something to do with this, like clear cache from previous script.
Anyone knows what "rerun in upstream in pipeline" does?
Many thanks!
"Rerun with upstream in pipeline" basically means "recalculate with all dependencies". For example, if one has pipeline1 -> dataset1 -> pipeline2 and tries to rerun pipeline2 with dependecies, then pipeline1 and pipeline2 will be both executed. I believe it works same with several chained activities within single pipeline.

What's wrong with this ASP recursive function?

When I call this function, everything works, as long as I don't try to recursively call the function again. In other words if I uncomment the line:
GetChilds rsData("AcctID"), intLevel + 1
Then the function breaks.
<%
Function GetChilds(ParentID, intLevel)
Set rsData= Server.CreateObject("ADODB.Recordset")
sSQL = "SELECT AcctID, ParentID FROM Accounts WHERE ParentID='" & ParentID &"'"
rsData.Open sSQL, conDB, adOpenKeyset, adLockOptimistic
If IsRSEmpty(rsData) Then
Response.Write("Empty")
Else
Do Until rsData.EOF
Response.Write rsData("AcctID") & "<br />"
'GetChilds rsData("AcctID"), intLevel + 1
rsData.MoveNext
Loop
End If
rsData.close: set rsData = nothing
End Function
Call GetChilds(1,0)
%>
*Edited after feedback
Thanks everyone,
Other than the usual error:
Error Type: (0x80020009) Exception occurred.
I wasn't sure what was causing the problems. I understand that is probably due to a couple of factors.
Not closing the connection and attempting to re-open the same connection.
To many concurrent connections to the database.
The database content is as follows:
AcctID | ParentID
1 Null
2 1
3 1
4 2
5 2
6 3
7 4
The idea is so that I can have a Master Account with Child Accounts, and those Child Accounts can have Child Accounts of their Own. Eventually there will be Another Master Account with a ParentID of Null that will have childs of its own. With that in mind, am I going about this the correct way?
Thanks for the quick responses.
Thanks everyone,
Other than the usual error:
Error Type: (0x80020009) Exception
occurred.
I wasn't sure what was causing the problems. I understand that is probably due to a couple of factors.
Not closing the connection and attempting to re-open the same connection.
To many concurrent connections to the database.
The database content is as follows:
AcctID | ParentID
1 Null
2 1
3 1
4 2
5 2
6 3
7 4
The idea is so that I can have a Master Account with Child Accounts, and those Child Accounts can have Child Accounts of their Own. Eventually there will be Another Master Account with a ParentID of Null that will have childs of its own. With that in mind, am I going about this the correct way?
Thanks for the quick responses.
Look like it fails because your connection is still busy serving the RecordSet from the previous call.
One option is to use a fresh connection for each call. The danger there is that you'll quickly run out of connections if you recurse too many times.
Another option is to read the contents of each RecordSet into a disconnected collection: (Dictionary, Array, etc) so you can close the connection right away. Then iterate over the disconnected collection.
If you're using SQL Server 2005 or later there's an even better option. You can use a CTE (common table expression) to write a recursive sql query. Then you can move everything to the database and you only need to execute one query.
Some other notes:
ID fields are normally ints, so you shouldn't encase them in ' characters in the sql string.
Finally, this code is probably okay because I doubt the user is allowed to input an id number directly. However, the dynamic sql technique used is very dangerous and should generally be avoided. Use query parameters instead to prevent sql injection.
I'm not too worried about not using intLevel for anything. Looking at the code this is obviously an early version, and intLevel can be used later to determine something like indentation or the class name used when styling an element.
Running out of SQL Connections?
You are dealing with so many layers there (Response.Write for the client, the ASP for the server, and the database) that its not surprising that there are problems.
Perhaps you can post some details about the error?
hard to tell without more description of how it breaks, but you are not using intLevel for anything.
How does it break?
My guess is that after a certain number of recursions you're probably getting a Stack Overflow (ironic) because you're not allocating too many RecordSets.
In each call you open a new connection to the database and you don't close it before opening a new one.
Not that this is actually a solution to the recursion issue, but it might be better for you to work out an SQL statement that returns all the information in a hierarchical format, rather than making recursive calls to your database.
Come to think of it though, it may be because you have too many concurrent db connections. You continually open, but aren't going to start closing until your pulling out of your recursive loop.
try declaring the variables as local using a DIM statement within the function definition:
Function GetChilds(ParentID, intLevel)
Dim rsData, sSQL
Set ...
Edit: Ok, I try to be more explicit.
My understanding is that since rsData is not declared by DIM, it is not a local variable, but a global var. Therefore, if you loop through the WHILE statement, you reach the .Eof of the inner-most rsData recordset. You return from the recursive function call, and the next step is again a rsData.MoveNext, which fails.
Please correct me if rsData is indeed local.
If you need recursion such as this I would personally put the recursion into a stored procedure and handle that processing on the database side in order to avoid opening multiple connections. If you are using mssql2005 look into something called Common Table Expressions (CTE), they make recursion easy. There are other ways to implement recursion with other RDBMS's.
Based on the sugestions I will atempt to move the query into a CTE (common table expression) when I find a good tutorial on how to do that. For now and as a quick and dirty fix, I have changed the code as follows:
Function GetChilds(ParentID, intLevel)
'Open my Database Connection and Query the current Parent ID
Set rsData= Server.CreateObject("ADODB.Recordset")
sSQL = "SELECT AcctID, ParentID FROM Accounts WHERE ParentID='" & ParentID &"'"
rsData.Open sSQL, conDB, adOpenKeyset, adLockOptimistic
'If the Record Set is not empty continue
If Not IsRSEmpty(rsData) Then
Dim myAccts()
ReDim myAccts(rsData.RecordCount)
Dim i
i = 0
Do Until rsData.EOF
Response.Write "Account ID: " & rsData("AcctID") & " ParentID: " & rsData("ParentID") & "<br />"
'Add the Childs of the current Parent ID to an array.
myAccts(i) = rsData("AcctID")
i = i + 1
rsData.MoveNext
Loop
'Close the SQL connection and get it ready for reopen. (I know not the best way but hey I am just learning this stuff)
rsData.close: set rsData = nothing
'For each Child found in the previous query, now lets get their childs.
For i = 0 To UBound(myAccts)
Call GetChilds(myAccts(i), intLevel + 1)
Next
End If
End Function
Call GetChilds(1,0)
I have working code with the same scenario.
I use a clientside cursor
...
rsData.CursorLocation = adUseClient
rsData.Open sSQL, conDB, adOpenKeyset, adLockOptimistic
rsData.ActiveConnectcion = Nothing
...
as pointed out in other responses, this is not very efficient, I use it only in an admin interface where the code is called infrequently and speed is not as critical.
I would not use such a recursive process in a regular web page.
Either rework the code to get all data in one call from the database, or make the call once and save it to a local array and save the array in an application variable.

Resources