How to correctly unit test (using nose) a sqlalchemy Model by creating a new database - sqlite

I currently have an application using flask-sqlalchemy. My model is connected to a postgresql database, and now I would like to write unit tests (using nose). I was told to use SQLite to create a new database for testing, and after a lot of searching (and looking at the texting section on the flask-sqlalchemy website) I'm still confused as how to do it. Each class in my model.py looks something like the following:
db = SQLAlchemy(app)
class Prod(db.Model):
__tablename__ = 'prod'
id = db.Column(db.Integer, primary_key=True)
desc = db.Column(db.String)
def __init__(self, id, desc):
self.id = id
self.desc = desc
My config.py:
app = Flask(__name__)
app.config['SQLALCHEMY_DATABASE_URI'] = 'postgres://name:pass#server/db
and I would like to test my insert functions in a new file by setting up and tearing down a new database for each test. If anyone can give me some example code that would be great. Thanks!

I can't answer your specific question, but will provide some general advise:
You will find that setting up and tearing down the complete database for each test will be too slow. Imagine in the future when you might have hundreds of tests or even thousands.
The approach we take is:
For testing purposes we have a database populated with test data. We have a script which creates a fresh database and populates it with this test data.
We run this script prior to running our test suite. All tests can assume this data exists.
Each test may create additional records if necessary, but it is their responsibility to undo any changes they make (delete new records, undo changes) - in order words to leave the database in the same state as it was before the test began. This prevents tests from interfering with each other.
In a project I manage we have a test suite of 1070 tests which runs in about 5 minutes using this approach.
What if we had taken your approach? Let's assume that 50% of these tests actually exercise the database (and need a fresh reload). That's 1070 * .50 * 20 seconds for the reload / 3600 = 2.97 hours. Oops - that's far too slow to be useful.
Even at a much smaller scale though, you'll be much happier if your test suite runs in 1 minute instead of 20 minutes.

Related

Is it possible to create dynamic jobs with Dagster?

Consider this example - you need to load table1 from source database, do some generic transformations (like convert time zones for timestamped columns) and write resulting data into Snowflake. This is an easy one and can be implemented using 3 dagster ops.
Now, imagine you need to do the same thing but with 100s of tables. How would you do it with dagster? Do you literally need to create 100 jobs/graphs? Or can you create one job, that will be executed 100 times? Can you throttle how many of these jobs will run at the same time?
You have a two main options for doing this:
Use a single job with Dynamic Outputs:
With this setup, all of your ETLs would happen in a single job. You would have an initial op that would yield a DynamicOutput for each table name that you wanted to do this process for, and feed that into a set of ops (probably organized into a graph) that would be run on each individual DynamicOutput.
Depending on what executor you're using, it's possible to limit the overall step concurrency (for example, the default multiprocess_executor supports this option).
Create a configurable job (I think this is more likely what you want)
from dagster import job, op, graph
import pandas as pd
#op(config_schema={"table_name": str})
def extract_table(context) -> pd.DataFrame:
table_name = context.op_config["table_name"]
# do some load...
return pd.DataFrame()
#op
def transform_table(table: pd.DataFrame) -> pd.DataFrame:
# do some transform...
return table
#op(config_schema={"table_name": str})
def load_table(context, table: pd.DataFrame):
table_name = context.op_config["table_name"]
# load to snowflake...
#job
def configurable_etl():
load_table(transform_table(extract_table()))
# this is what the configuration would look like to extract from table
# src_foo and load into table dest_foo
configurable_etl.execute_in_process(
run_config={
"ops": {
"extract_table": {"config": {"table_name": "src_foo"}},
"load_table": {"config": {"table_name": "dest_foo"}},
}
}
)
Here, you create a job that can be pointed at a source table and a destination table by giving the relevant ops a config schema. Depending on those config options, (which are provided when you create a run through the run config), your job will operate on different source / destination tables.
The example shows explicitly running this job using python APIs, but if you're running it from Dagit, you'll also be able to input the yaml version of this config there. If you want to simplify the config schema (as it's pretty nested as shown), you can always create a Config Mapping to make the interface nicer :)
From here, you can limit run concurrency by supplying a unique tag to your job, and using a QueuedRunCoordinator to limit the maximum number of concurrent runs for that tag.

How can I log sql execution results in airflow?

I use airflow python operators to execute sql queries against a redshift/postgres database. In order to debug, I'd like the DAG to return the results of the sql execution, similar to what you would see if executing locally in a console:
I'm using psycop2 to create a connection/cursor and execute the sql. Having this logged would be extremely helpful to confirm the parsed parameterized sql, and confirm that data was actually inserted (I have painfully experiences issues where differences in environments caused unexpected behavior)
I do not have deep knowledge of airflow or the low level workings of the python DBAPI, but the pscyopg2 documentation does seem to refer to some methods and connection configurations that may allow this.
I find it very perplexing that this is difficult to do, as I'd imagine it would be a primary use case of running ETLs on this platform. I've heard suggestions to simply create additional tasks that query the table before and after, but this seems clunky and ineffective.
Could anyone please explain how this may be possible, and if not, explain why? Alternate methods of achieving similar results welcome. Thanks!
So far I have tried the connection.status_message() method, but it only seems to return the first line of the sql and not the results. I have also attempted to create a logging cursor, which produces the sql, but not the console results
import logging
import psycopg2 as pg
from psycopg2.extras import LoggingConnection
conn = pg.connect(
connection_factory=LoggingConnection,
...
)
conn.autocommit = True
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stdout))
conn.initialize(logger)
cur = conn.cursor()
sql = """
INSERT INTO mytable (
SELECT *
FROM other_table
);
"""
cur.execute(sql)
I'd like the logger to return something like:
sql> INSERT INTO mytable (
SELECT ...
[2019-07-25 23:00:54] 912 rows affected in 4 s 442 ms
Let's assume you are writing an operator that uses postgres hook to do something in sql.
Anything printed inside an operator is logged.
So, if you want to log the statement, just print the statement in your operator.
print(sql)
If you want to log the result, fetch the result and print the result.
E.g.
result = cur.fetchall()
for row in result:
print(row)
Alternatively you can use self.log.info in place of print, where self refers to the operator instance.
Ok, so after some trial and error I've found a method that works for my setup and objective. To recap, my goal is to run ETL's via python scripts, orchestrated in Airflow. Referring to the documentation for statusmessage:
Read-only attribute containing the message returned by the last command:
The key is to manage logging in context with transactions executed on the server. In order for me to do this, I had to specifically set con.autocommit = False, and wrap SQL blocks with BEGIN TRANSACTION; and END TRANSACTION;. If you insert cur.statusmessage directly following a statement that deletes or inserts, you will get a response such as 'INSERT 0 92380'.
This still isn't as verbose as I would prefer, but it is a much better than nothing, and is very useful for troubleshooting ETL issues within Airflow logs.
Side notes:
- When autocommit is set to False, you must explicitly commit transactions.
- It may not be necessary to state transaction begin/end in your SQL. It may depend on your DB version.
con = psy.connect(...)
con.autocommit = False
cur = con.cursor()
try:
cur.execute([some_sql])
logging.info(f"Cursor statusmessage: {cur.statusmessage})
except:
con.rollback()
finally:
con.close()
There is some buried functionality within psycopg2 that I'm sure can be utilized, but the documentation is pretty thin and there are no clear examples. If anyone has suggestions on how to utilize things such as logobjects, or returning join PID to somehow retrieve additional information.

How to read documents from Change Feed in Azure Cosmos DB since last checkpoint after restart?

I am using Change Feed processor library to read the Change Feed on a partitioned collection and below is the code for how I have configure it. I ma using most of the default options.
ChangeFeedProcessorOptions feedProcessorOptions = new
{
LeaseRenewInterval = TimeSpan.FromSeconds(15),
};
var docObserverFactory = DocumentFeedObserverFactory.Create(this.destinationCollectionInfo, this.dbRepository);
this.builder
.WithHostName(hostName)
.WithFeedCollection(this.monitoredCollectionInfo)
.WithLeaseCollection(this.leaseCollectionInfo)
.WithProcessorOptions(feedProcessorOptions)
.WithObserverFactory(docObserverFactory);
This runs fine as long as the Change Feed application is running and documents are being inserted/updated in the collection and the Change Feed app picks them up as expected.
The problem happens when I stop the Change Feed app for sometime and insert/update few documents in the Collection. Then when I start the Change Feed app, it doesn't pick changes from where it last left. Those changes that were inserted when the Change Feed app was stopped are lost. But when I set the flag StartFromBeginning to true, it picks everything from the start including changes that were inserted when the Change Feed app was stopped in between for sometime.
My understanding of read from current (StartFromBeginning to false) is that the Change Feed reads documents since it last left. But that doesn't seem to happen. Please help.
There are two ways to continue from exactly where you left it.
The first, and more accurate one, is to store the Continuation token of the last thing you read. That way you can specify it when you start again and it will win over both the StartTime and the StartFromBeginning flags.
The second one is to provide the StartTime property which will try and find the continuation token of a given time automatically. It has an approximate 5 second precision so there is a chance that you might miss some documents though.

How to check which SQL query is so CPU intensive

Is there any possible way to check which query is so CPU intensive in _sqlsrv2 process?
Something which give me information about executed query in that process in that moment.
Is there any way to terminate that query without killing _sqlsrv2 process?
I cannot find any official materials in that subject.
Thank You for any help.
You could look into client database-request caching.
Code examples below assume you have ABL access to the environment. If not you will have to use SQL instead but it shouldn't be to hard to "translate" the code below
I haven't used this a lot myself but I wouldn't be surprised if it has some impact on performance.
You need to start caching in the active connection. This can be done in the connection itself or remotely via VST tables (as long as your remote session is connected to the same database) so you need to be able to identify your connections. This can be done via the process ID.
Generally how to enable the caching:
/* "_myconnection" is your current connection. You shouldn't do this */
FIND _myconnection NO-LOCK.
FIND _connect WHERE _connect-usr = _myconnection._MyConn-userid.
/* Start caching */
_connect._Connect-CachingType = 3.
DISPLAY _connect WITH FRAME x1 SIDE-LABELS WIDTH 100 1 COLUMN.
/* End caching */
_connect._Connect-CachingType = 0.
You need to identify your process first, via top or another program.
Then you can do something like:
/* Assuming pid 21966 */
FIND FIRST _connect NO-LOCK WHERE _Connect._Connect-Pid = 21966 NO-ERROR.
IF AVAILABLE _Connect THEN
DISPLAY _connect.
You could also look at the _Connect-Type. It should be 'SQLC' for SQL connections.
FOR EACH _Connect NO-LOCK WHERE _Connect._connect-type = "SQLC":
DISPLAY _connect._connect-type.
END.
Best of all would be to do this in a separate environment. If you can't at least try it in a test environment first.
Here's a good guide.
You can use a Select like this:
select
c."_Connect-type",
c."_Connect-PID" as 'PID',
c."_connect-ipaddress" as 'IP',
c."_Connect-CacheInfo"
from
pub."_connect" c
where
c."_Connect-CacheInfo" is not null
But first you need to enable connection cache, follow this example

Biztalk Dehydrating and Hydrating Loop

I have a simple Biztalk Application 2013-r2 that imports a file into a table, then executes a long running post import process (via stored procedures).
Symptoms: when importing 2 files
The import of the first file has no issues
Then, the post processing starts (slow as expected due to long running stored procedure)
Then if you drop a second file, the first file post processing dissapears and the second import takes place.
Then they start alternating back and forth (you can see the post processing field being populated as expected)
Both send ports are active, sometimes you see a third one dehydrated
Since there are no errors reported, this must be a setting or do I need to move the post processing out of the long running transaction?
Details:
Orchestration Transaction Type is long running
The time out for the post processing send port is 59 minutes
The post processing stored procedure invokes child stored procedures.
No errors are reported anywhere
Both send ports have ordered delivery checked
Post Processing Stored Procedures:
CREATE PROCEDURE [sync].[MPostProcessing]
#Code NVARCHAR(2)
AS
----
---- 2. Normalize Address
----
IF #Code = '99'
EXEC sync.AElBatch #Code = #Code
CREATE PROCEDURE [sync].[AElBatch ] #Code AS VARCHAR(2)
AS
DECLARE #ID AS INT
WHILE EXISTS ( SELECT ID
FROM sync.[mtable]
WHERE Code = #Code
AND PostProcessingDone = 0 )
BEGIN
SELECT TOP 1
#ID = ID
FROM sync.[mtable]
WHERE Code = #Code
AND PostProcessingDone = 0
EXEC sync.PParse #ID = #ID
UPDATE sync.[mtable]
SET PostProcessingDone = 1
WHERE Code = #Code
AND ID = #ID
END
And then the PPArse stored procedure does more (all working, no errors reported)
Image of Biztalk Server Administration Console
So this is too long for a comment but I'm not 100% sure of your problem still. Either way:
It seems like you likely have some issues with your SPs. Refactor them to use set based queries instead of while loops (or cursors if you have any). Forcing SQL Server to process each individual scalar variable as a separate call will prevent it from fully optimizing whatever it's doing in sync.PParse - pass a table variable to it or something if you need to so that it can parallelize it properly and stop holding things up so badly.
It's quite possible that sync.PParse has a bug in it that is reading data it shouldn't. These lines in particular from AElBatch are troubling:
SELECT TOP 1
#ID = ID
FROM sync.[mtable]
WHERE Code = #Code
AND PostProcessingDone = 0
You probably want to add a batch identifier in there of some sort so that PostProcessing#2 doesn't start picking up what was really meant for PostProcessing#1.
Double check what's going on with sp_who2, see if things are getting blocked. It's likely that something is going on there, even if no errors are surfacing properly.
In the end, if none of that works, you might have to make them into a single SP that BizTalk calls so that Ordered Delivery will keep both jobs in the same queue - rather than allowing File Load #2 to complete before post processing job #1 is done.

Resources