Is it possible to create dynamic jobs with Dagster? - dagster

Consider this example - you need to load table1 from source database, do some generic transformations (like convert time zones for timestamped columns) and write resulting data into Snowflake. This is an easy one and can be implemented using 3 dagster ops.
Now, imagine you need to do the same thing but with 100s of tables. How would you do it with dagster? Do you literally need to create 100 jobs/graphs? Or can you create one job, that will be executed 100 times? Can you throttle how many of these jobs will run at the same time?

You have a two main options for doing this:
Use a single job with Dynamic Outputs:
With this setup, all of your ETLs would happen in a single job. You would have an initial op that would yield a DynamicOutput for each table name that you wanted to do this process for, and feed that into a set of ops (probably organized into a graph) that would be run on each individual DynamicOutput.
Depending on what executor you're using, it's possible to limit the overall step concurrency (for example, the default multiprocess_executor supports this option).
Create a configurable job (I think this is more likely what you want)
from dagster import job, op, graph
import pandas as pd
#op(config_schema={"table_name": str})
def extract_table(context) -> pd.DataFrame:
table_name = context.op_config["table_name"]
# do some load...
return pd.DataFrame()
#op
def transform_table(table: pd.DataFrame) -> pd.DataFrame:
# do some transform...
return table
#op(config_schema={"table_name": str})
def load_table(context, table: pd.DataFrame):
table_name = context.op_config["table_name"]
# load to snowflake...
#job
def configurable_etl():
load_table(transform_table(extract_table()))
# this is what the configuration would look like to extract from table
# src_foo and load into table dest_foo
configurable_etl.execute_in_process(
run_config={
"ops": {
"extract_table": {"config": {"table_name": "src_foo"}},
"load_table": {"config": {"table_name": "dest_foo"}},
}
}
)
Here, you create a job that can be pointed at a source table and a destination table by giving the relevant ops a config schema. Depending on those config options, (which are provided when you create a run through the run config), your job will operate on different source / destination tables.
The example shows explicitly running this job using python APIs, but if you're running it from Dagit, you'll also be able to input the yaml version of this config there. If you want to simplify the config schema (as it's pretty nested as shown), you can always create a Config Mapping to make the interface nicer :)
From here, you can limit run concurrency by supplying a unique tag to your job, and using a QueuedRunCoordinator to limit the maximum number of concurrent runs for that tag.

Related

How can I log sql execution results in airflow?

I use airflow python operators to execute sql queries against a redshift/postgres database. In order to debug, I'd like the DAG to return the results of the sql execution, similar to what you would see if executing locally in a console:
I'm using psycop2 to create a connection/cursor and execute the sql. Having this logged would be extremely helpful to confirm the parsed parameterized sql, and confirm that data was actually inserted (I have painfully experiences issues where differences in environments caused unexpected behavior)
I do not have deep knowledge of airflow or the low level workings of the python DBAPI, but the pscyopg2 documentation does seem to refer to some methods and connection configurations that may allow this.
I find it very perplexing that this is difficult to do, as I'd imagine it would be a primary use case of running ETLs on this platform. I've heard suggestions to simply create additional tasks that query the table before and after, but this seems clunky and ineffective.
Could anyone please explain how this may be possible, and if not, explain why? Alternate methods of achieving similar results welcome. Thanks!
So far I have tried the connection.status_message() method, but it only seems to return the first line of the sql and not the results. I have also attempted to create a logging cursor, which produces the sql, but not the console results
import logging
import psycopg2 as pg
from psycopg2.extras import LoggingConnection
conn = pg.connect(
connection_factory=LoggingConnection,
...
)
conn.autocommit = True
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stdout))
conn.initialize(logger)
cur = conn.cursor()
sql = """
INSERT INTO mytable (
SELECT *
FROM other_table
);
"""
cur.execute(sql)
I'd like the logger to return something like:
sql> INSERT INTO mytable (
SELECT ...
[2019-07-25 23:00:54] 912 rows affected in 4 s 442 ms
Let's assume you are writing an operator that uses postgres hook to do something in sql.
Anything printed inside an operator is logged.
So, if you want to log the statement, just print the statement in your operator.
print(sql)
If you want to log the result, fetch the result and print the result.
E.g.
result = cur.fetchall()
for row in result:
print(row)
Alternatively you can use self.log.info in place of print, where self refers to the operator instance.
Ok, so after some trial and error I've found a method that works for my setup and objective. To recap, my goal is to run ETL's via python scripts, orchestrated in Airflow. Referring to the documentation for statusmessage:
Read-only attribute containing the message returned by the last command:
The key is to manage logging in context with transactions executed on the server. In order for me to do this, I had to specifically set con.autocommit = False, and wrap SQL blocks with BEGIN TRANSACTION; and END TRANSACTION;. If you insert cur.statusmessage directly following a statement that deletes or inserts, you will get a response such as 'INSERT 0 92380'.
This still isn't as verbose as I would prefer, but it is a much better than nothing, and is very useful for troubleshooting ETL issues within Airflow logs.
Side notes:
- When autocommit is set to False, you must explicitly commit transactions.
- It may not be necessary to state transaction begin/end in your SQL. It may depend on your DB version.
con = psy.connect(...)
con.autocommit = False
cur = con.cursor()
try:
cur.execute([some_sql])
logging.info(f"Cursor statusmessage: {cur.statusmessage})
except:
con.rollback()
finally:
con.close()
There is some buried functionality within psycopg2 that I'm sure can be utilized, but the documentation is pretty thin and there are no clear examples. If anyone has suggestions on how to utilize things such as logobjects, or returning join PID to somehow retrieve additional information.

Airflow Custom Metrics and/or Result Object with custom fields

While running pySpark SQL pipelines via Airflow I am interested in getting out some business stats like:
source read count
target write count
sizes of DFs during processing
error records count
One idea is to push it directly to the metrics, so it will gets automatically consumed by monitoring tools like Prometheus. Another idea is to obtain these values via some DAG result object, but I wasn't able to find anything about it in docs.
Please post some at least pseudo code if you have solution.
I would look to reuse Airflow's statistics and monitoring support in the airflow.stats.Stats class. Maybe something like this:
import logging
from airflow.stats import Stats
PYSPARK_LOG_PREFIX = "airflow_pyspark"
def your_python_operator(**context):
[...]
try:
Stats.incr(f"{PYSPARK_LOG_PREFIX}_read_count", src_read_count)
Stats.incr(f"{PYSPARK_LOG_PREFIX}_write_count", tgt_write_count)
# So on and so forth
except:
logging.exception("Caught exception during statistics logging")
[...]

Graphite Derivative shows no data

Using graphite/Grafana to record the sizes of all collections in a mongodb instance. I wrote a simple (WIP) python script to do so:
#!/usr/bin/python
from pymongo import MongoClient
import socket
import time
statsd_ip = '127.0.0.1'
statsd_port = 8125
# create a udp socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
client = MongoClient(host='12.34.56.78', port=12345)
db = client.my_DB
# get collection list each runtime
collections = db.collection_names()
sizes = {}
# main
while (1):
# get collection size per name
for collection in collections:
sizes[collection] = db.command('collstats', collection)['size']
# write to statsd
for size in sizes:
MESSAGE = "collection_%s:%d|c" % (size, sizes[size])
sock.sendto(MESSAGE, (statsd_ip, statsd_port))
time.sleep(60)
This properly shows all of my collection sizes in grafana. However, I want to get a rate of change on these sizes, so I build the following graphite query in grafana:
derivative(statsd.myHost.collection_myCollection)
And the graph shows up totally blank. Any ideas?
FOLLOW-UP: When selecting a time range greater than 24h, all data similarly disappears from the graph. Can't for the life of me figure out that one.
Update: This was due to the fact that my collectd was configured to send samples every second. The statsd plugin for collectd, however, was receiving data every 60 seconds, so I ended up with None for most data points.
I discovered this by checking the raw data in Graphite by appending &format=raw to the end of a graphite-api query in a browser, which gives you the value of each data point as a comma-separated list.
The temporary fix for this was to surround the graphite query with keepLastValue(60). This however creates a stair-step graph, as the value for each None (60 values) becomes the last valid value within 60 steps. Graphing a derivative of this then becomes a widely spaced sawtooth graph.
In order to fix this, I will probably go on to fix the flush interval on collectd or switch to a standalone statsd instance and configure as necessary from there.

Biztalk Dehydrating and Hydrating Loop

I have a simple Biztalk Application 2013-r2 that imports a file into a table, then executes a long running post import process (via stored procedures).
Symptoms: when importing 2 files
The import of the first file has no issues
Then, the post processing starts (slow as expected due to long running stored procedure)
Then if you drop a second file, the first file post processing dissapears and the second import takes place.
Then they start alternating back and forth (you can see the post processing field being populated as expected)
Both send ports are active, sometimes you see a third one dehydrated
Since there are no errors reported, this must be a setting or do I need to move the post processing out of the long running transaction?
Details:
Orchestration Transaction Type is long running
The time out for the post processing send port is 59 minutes
The post processing stored procedure invokes child stored procedures.
No errors are reported anywhere
Both send ports have ordered delivery checked
Post Processing Stored Procedures:
CREATE PROCEDURE [sync].[MPostProcessing]
#Code NVARCHAR(2)
AS
----
---- 2. Normalize Address
----
IF #Code = '99'
EXEC sync.AElBatch #Code = #Code
CREATE PROCEDURE [sync].[AElBatch ] #Code AS VARCHAR(2)
AS
DECLARE #ID AS INT
WHILE EXISTS ( SELECT ID
FROM sync.[mtable]
WHERE Code = #Code
AND PostProcessingDone = 0 )
BEGIN
SELECT TOP 1
#ID = ID
FROM sync.[mtable]
WHERE Code = #Code
AND PostProcessingDone = 0
EXEC sync.PParse #ID = #ID
UPDATE sync.[mtable]
SET PostProcessingDone = 1
WHERE Code = #Code
AND ID = #ID
END
And then the PPArse stored procedure does more (all working, no errors reported)
Image of Biztalk Server Administration Console
So this is too long for a comment but I'm not 100% sure of your problem still. Either way:
It seems like you likely have some issues with your SPs. Refactor them to use set based queries instead of while loops (or cursors if you have any). Forcing SQL Server to process each individual scalar variable as a separate call will prevent it from fully optimizing whatever it's doing in sync.PParse - pass a table variable to it or something if you need to so that it can parallelize it properly and stop holding things up so badly.
It's quite possible that sync.PParse has a bug in it that is reading data it shouldn't. These lines in particular from AElBatch are troubling:
SELECT TOP 1
#ID = ID
FROM sync.[mtable]
WHERE Code = #Code
AND PostProcessingDone = 0
You probably want to add a batch identifier in there of some sort so that PostProcessing#2 doesn't start picking up what was really meant for PostProcessing#1.
Double check what's going on with sp_who2, see if things are getting blocked. It's likely that something is going on there, even if no errors are surfacing properly.
In the end, if none of that works, you might have to make them into a single SP that BizTalk calls so that Ordered Delivery will keep both jobs in the same queue - rather than allowing File Load #2 to complete before post processing job #1 is done.

How to correctly unit test (using nose) a sqlalchemy Model by creating a new database

I currently have an application using flask-sqlalchemy. My model is connected to a postgresql database, and now I would like to write unit tests (using nose). I was told to use SQLite to create a new database for testing, and after a lot of searching (and looking at the texting section on the flask-sqlalchemy website) I'm still confused as how to do it. Each class in my model.py looks something like the following:
db = SQLAlchemy(app)
class Prod(db.Model):
__tablename__ = 'prod'
id = db.Column(db.Integer, primary_key=True)
desc = db.Column(db.String)
def __init__(self, id, desc):
self.id = id
self.desc = desc
My config.py:
app = Flask(__name__)
app.config['SQLALCHEMY_DATABASE_URI'] = 'postgres://name:pass#server/db
and I would like to test my insert functions in a new file by setting up and tearing down a new database for each test. If anyone can give me some example code that would be great. Thanks!
I can't answer your specific question, but will provide some general advise:
You will find that setting up and tearing down the complete database for each test will be too slow. Imagine in the future when you might have hundreds of tests or even thousands.
The approach we take is:
For testing purposes we have a database populated with test data. We have a script which creates a fresh database and populates it with this test data.
We run this script prior to running our test suite. All tests can assume this data exists.
Each test may create additional records if necessary, but it is their responsibility to undo any changes they make (delete new records, undo changes) - in order words to leave the database in the same state as it was before the test began. This prevents tests from interfering with each other.
In a project I manage we have a test suite of 1070 tests which runs in about 5 minutes using this approach.
What if we had taken your approach? Let's assume that 50% of these tests actually exercise the database (and need a fresh reload). That's 1070 * .50 * 20 seconds for the reload / 3600 = 2.97 hours. Oops - that's far too slow to be useful.
Even at a much smaller scale though, you'll be much happier if your test suite runs in 1 minute instead of 20 minutes.

Resources