While running pySpark SQL pipelines via Airflow I am interested in getting out some business stats like:
source read count
target write count
sizes of DFs during processing
error records count
One idea is to push it directly to the metrics, so it will gets automatically consumed by monitoring tools like Prometheus. Another idea is to obtain these values via some DAG result object, but I wasn't able to find anything about it in docs.
Please post some at least pseudo code if you have solution.
I would look to reuse Airflow's statistics and monitoring support in the airflow.stats.Stats class. Maybe something like this:
import logging
from airflow.stats import Stats
PYSPARK_LOG_PREFIX = "airflow_pyspark"
def your_python_operator(**context):
[...]
try:
Stats.incr(f"{PYSPARK_LOG_PREFIX}_read_count", src_read_count)
Stats.incr(f"{PYSPARK_LOG_PREFIX}_write_count", tgt_write_count)
# So on and so forth
except:
logging.exception("Caught exception during statistics logging")
[...]
Related
Consider this example - you need to load table1 from source database, do some generic transformations (like convert time zones for timestamped columns) and write resulting data into Snowflake. This is an easy one and can be implemented using 3 dagster ops.
Now, imagine you need to do the same thing but with 100s of tables. How would you do it with dagster? Do you literally need to create 100 jobs/graphs? Or can you create one job, that will be executed 100 times? Can you throttle how many of these jobs will run at the same time?
You have a two main options for doing this:
Use a single job with Dynamic Outputs:
With this setup, all of your ETLs would happen in a single job. You would have an initial op that would yield a DynamicOutput for each table name that you wanted to do this process for, and feed that into a set of ops (probably organized into a graph) that would be run on each individual DynamicOutput.
Depending on what executor you're using, it's possible to limit the overall step concurrency (for example, the default multiprocess_executor supports this option).
Create a configurable job (I think this is more likely what you want)
from dagster import job, op, graph
import pandas as pd
#op(config_schema={"table_name": str})
def extract_table(context) -> pd.DataFrame:
table_name = context.op_config["table_name"]
# do some load...
return pd.DataFrame()
#op
def transform_table(table: pd.DataFrame) -> pd.DataFrame:
# do some transform...
return table
#op(config_schema={"table_name": str})
def load_table(context, table: pd.DataFrame):
table_name = context.op_config["table_name"]
# load to snowflake...
#job
def configurable_etl():
load_table(transform_table(extract_table()))
# this is what the configuration would look like to extract from table
# src_foo and load into table dest_foo
configurable_etl.execute_in_process(
run_config={
"ops": {
"extract_table": {"config": {"table_name": "src_foo"}},
"load_table": {"config": {"table_name": "dest_foo"}},
}
}
)
Here, you create a job that can be pointed at a source table and a destination table by giving the relevant ops a config schema. Depending on those config options, (which are provided when you create a run through the run config), your job will operate on different source / destination tables.
The example shows explicitly running this job using python APIs, but if you're running it from Dagit, you'll also be able to input the yaml version of this config there. If you want to simplify the config schema (as it's pretty nested as shown), you can always create a Config Mapping to make the interface nicer :)
From here, you can limit run concurrency by supplying a unique tag to your job, and using a QueuedRunCoordinator to limit the maximum number of concurrent runs for that tag.
I was wondering if anyone can tell me how to get the action log of the instance using openstacksdk, novaclient. And while getting the action log, I also want to get the flavor attached to it. See the attached picture please.
I actually got the action log using this novaclient module:
novaclient.v2.instance_action.InstanceAction
but it shows me very little details and without the flavor id that I needed. The following fields it shows me are the following:
action, instance_uuid, message, project_id, request_id, start_time and user_id
I hope anyone can tell me how to get it.
I don't think it is possible to get the flavor id from the action list / server event list.
Openstack does not keep a database record of what each request did, or a historic record of the instance states. So you would need to resort to trawling the logs for the request-id ... which is OK for forensics, but does not scale. (And I don't know if the flavor is in the log messages.)
Of course, you could use the APIs (novaclient, openstacksdk) to get the current flavor for the instance, given its instance id. But that isn't exactly what you want.
It is possible record historical information using Gnochi + Ceilometer or similar, but you would need to have set this up already.
I use airflow python operators to execute sql queries against a redshift/postgres database. In order to debug, I'd like the DAG to return the results of the sql execution, similar to what you would see if executing locally in a console:
I'm using psycop2 to create a connection/cursor and execute the sql. Having this logged would be extremely helpful to confirm the parsed parameterized sql, and confirm that data was actually inserted (I have painfully experiences issues where differences in environments caused unexpected behavior)
I do not have deep knowledge of airflow or the low level workings of the python DBAPI, but the pscyopg2 documentation does seem to refer to some methods and connection configurations that may allow this.
I find it very perplexing that this is difficult to do, as I'd imagine it would be a primary use case of running ETLs on this platform. I've heard suggestions to simply create additional tasks that query the table before and after, but this seems clunky and ineffective.
Could anyone please explain how this may be possible, and if not, explain why? Alternate methods of achieving similar results welcome. Thanks!
So far I have tried the connection.status_message() method, but it only seems to return the first line of the sql and not the results. I have also attempted to create a logging cursor, which produces the sql, but not the console results
import logging
import psycopg2 as pg
from psycopg2.extras import LoggingConnection
conn = pg.connect(
connection_factory=LoggingConnection,
...
)
conn.autocommit = True
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stdout))
conn.initialize(logger)
cur = conn.cursor()
sql = """
INSERT INTO mytable (
SELECT *
FROM other_table
);
"""
cur.execute(sql)
I'd like the logger to return something like:
sql> INSERT INTO mytable (
SELECT ...
[2019-07-25 23:00:54] 912 rows affected in 4 s 442 ms
Let's assume you are writing an operator that uses postgres hook to do something in sql.
Anything printed inside an operator is logged.
So, if you want to log the statement, just print the statement in your operator.
print(sql)
If you want to log the result, fetch the result and print the result.
E.g.
result = cur.fetchall()
for row in result:
print(row)
Alternatively you can use self.log.info in place of print, where self refers to the operator instance.
Ok, so after some trial and error I've found a method that works for my setup and objective. To recap, my goal is to run ETL's via python scripts, orchestrated in Airflow. Referring to the documentation for statusmessage:
Read-only attribute containing the message returned by the last command:
The key is to manage logging in context with transactions executed on the server. In order for me to do this, I had to specifically set con.autocommit = False, and wrap SQL blocks with BEGIN TRANSACTION; and END TRANSACTION;. If you insert cur.statusmessage directly following a statement that deletes or inserts, you will get a response such as 'INSERT 0 92380'.
This still isn't as verbose as I would prefer, but it is a much better than nothing, and is very useful for troubleshooting ETL issues within Airflow logs.
Side notes:
- When autocommit is set to False, you must explicitly commit transactions.
- It may not be necessary to state transaction begin/end in your SQL. It may depend on your DB version.
con = psy.connect(...)
con.autocommit = False
cur = con.cursor()
try:
cur.execute([some_sql])
logging.info(f"Cursor statusmessage: {cur.statusmessage})
except:
con.rollback()
finally:
con.close()
There is some buried functionality within psycopg2 that I'm sure can be utilized, but the documentation is pretty thin and there are no clear examples. If anyone has suggestions on how to utilize things such as logobjects, or returning join PID to somehow retrieve additional information.
Suppose I have a metric named a.b.c.count. I am trying to write a python script which reads the latest value of the metric a.b.c.count in graphite.
I went through the docs and figured out that we can use curl to retrieve metrics from graphite using functions http://graphite.readthedocs.org/en/0.9.13-pre1/functions.html.
But still unable to figure out how to achieve the same.
I haven't seen a way to ask Graphite for a single value, but you can ask for a summary of values over a configurable period, and take the last one. (This is just for minimizing the data returned, you could as easily pull out the last value from any series in a given timeframe.) Example render parameters:
target=summarize(a.b.c.count,'1hour','last')&from=-1h&format=json
The JSON returned will look like this:
[{"target": "summarize(a.b.c.count, \"1hour\", \"last\")",
"datapoints": [[5.1333330000000004, 1442160000],
[5.5499989999999997, 1442163600]]}]
Here is a Python snippet to retrieve and parse this, using the 'requests' HTTP library
import requests
r = requests.get("http://graphite.yourdomain.com/render/?" +
"target=summarize(a.b.c.count,'1hour','last')&from=-1h&format=json")
print r.json()[0][u'datapoints'][-1][0]
I have a situation where I have a CrawlSpider that searches for results using postal codes and categories (POST data). I need to get all the results for all the categories in all postal codes. My spider takes a postal code and a category as arguments for the POST data. I want to programmatically start a spider for each postal code/category combo via a script.
The documentation explains you can run multiple spiders per process with this code example here: http://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process This is along the same thing that I want to do however I want to essentially queue up spiders to run one after the another after the preceding spider finishes.
Any ideas on how to accomplish this? There seems to be some answers that apply to older versions of scrapy (~0.13) but the architecture has changed and they no longer function with the latest stable (0.24.4)
You can rely on the spider_closed signal to start crawling for the next postal code/category. Here is the sample code (not tested) based on this answer and adopted for your use case:
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.settings import Settings
from twisted.internet import reactor
# for the sake of an example, sample postal codes
postal_codes = ['10801', '10802', '10803']
def configure_crawler(postal_code):
spider = MySpider(postal_code)
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# detach spider
crawler._spider = None
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
# callback fired when the spider is closed
def callback(spider, reason):
try:
postal_code = postal_codes.pop()
configure_crawler(postal_code)
except IndexError:
# stop the reactor if no postal codes left
reactor.stop()
settings = Settings()
crawler = Crawler(settings)
configure_crawler(postal_codes.pop())
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()