Dagster chaining resources - dagster

I've recently picked up Dagster to evaluate as an alternate to Airflow.
I haven't been able to wrap my head around the concept of resources and looking to understand if what I'm trying to do is possible or can be achieved better in a different way.
I have a helpder class like below that helps keep code DRY
from dagster import resource, solid, ModeDefinition, pipeline
from dagster_aws.s3 import s3_resource
class HelperAwsS3:
def __init__(self, s3_resource):
self.s3_resource = s3_resource
def s3_list_bucket(self, bucket, prefix):
return self.s3_resource.list_objects_v2(
Bucket=bucket,
Prefix=prefix
)
def s3_download_file(self, bucket, file, local_path):
self.s3_resource.meta.client.download_file(
Bucket=bucket,
Key=file,
Filename=local_path
)
def s3_upload_file(self, bucket, file, local_path):
self.s3_resource.meta.client.upload_file(
Bucket=bucket,
Key=file,
Filename=local_path
)
The s3_resource is actually dagster_aws.s3.s3_resource which will help me connect to AWS using my local aws credenitals.
I am not sure how to pass the s3_resource to the HelperAwsS3 when I make the call in the #resource section below.
#resource
def connection_helper_aws_s3_resource(context):
return HelperAwsS3()
Any pointers please? Or am I doing it all wrong and it needs doing in a different way?
Thanks for your help.

I posted the same question on the dagster Slack channel and qickly had a reply frok the helpful team. Posting it here, in case it helps someone -
keep your HelperAwsS3 class and write your own resource that uses the s3 resource, it could look something like this:
#resource(required_resource_keys={"s3"})
def connection_helper_aws_s3_resource(context):
return HelperAwsS3(s3_resource=context.resources.s3)
(And then be sure to include both the s3 resource and your custom resource on your mode definition:
#pipeline(mode_defs=[ModeDefinition(
resource_defs={"s3": s3_resource, "connection_helper_aws_s3": connection_helper_aws_s3_resource}
)]):
...

Related

How to allow the user to override a subset of the configuration using their own yaml file?

Let's say I have this basic app:
from dataclasses import dataclass
import hydra
from hydra.core.config_store import ConfigStore
#dataclass
class MyAppConfig:
req_int: int
opt_str: str = "Default String"
opt_float: float = 3.14
cs = ConfigStore.instance()
# Registering the Config class with the name 'config'.
cs.store(name="base_config", node=MyAppConfig)
#hydra.main(version_base=None, config_name="base_config", config_path="conf")
def my_app(cfg: MyAppConfig) -> None:
print(cfg)
if __name__ == "__main__":
my_app()
Is it possible for the user to be able to call my app like this:
python my_app.py req_int=42 --config="~/path/to/user-defined-config.yaml"
And user-defined-config.yaml would contain only this:
opt_str: User Config String
The output should look like this:
{'req_int': 42, 'opt_str': 'User Config String', 'opt_float': 3.14, 'config': 'hydra-user-conf'}
The closest I got to that is:
user-defined-config.yaml
defaults:
- base_config
- _self_
opt_str: User Config String
And the invocation:
python hydra/app.py req_int=42 --config-path='~/path/to' --config-name="hydra-user-conf"
But this way the user (who I don't want to require to be familiar with hydra) has to specify the path to their config file via two cli arguments and also include the defaults section in their config, which would be redundant boilerplate to them if they have to always include it in all of their configuration files.
Is this the closest I can get with hydra to the desired interface?
One thing you can do is to pre-configure the config searchpath in the primary config. Adding something like ~/.my_app/ to your config searchpath (thus potentially eliminating the need for --config-path|-cp.
In yaml it would look like:
hydra:
searchpath:
- file://${oc.env:HOME}/.my_app
Another thing to consider is having the app generating an initial config for the user on demand. I took this approach with Configen.
In general, the current patterns are not amazing and maybe there is room for some improvements in Hydra to make this more ergonomic (You can open a discussion about it).

Access global variable from Robot Framework listener 3

Not able to access global variable from Robot Framework listener 3.
I tried to access global variable with the help of BuiltIn() as below. But this is not working.
Please share if you have any ideas to do this.
from robot.libraries.BuiltIn import BuiltIn
ROBOT_LISTENER_API_VERSION = 3
def end_test(test, result):
print("BROWSER = '%s'" % BuiltIn().get_variables()['${BROWSER}'])
I think it should work like this:
BuiltIn().get_variable_value("${BROWSER}")

Import custom python modules into dag file without mixing dag environs and sys.path?

Is there any way to import custom python modules into dag file without mixing dag environs and sys.path? Can't use something like
environ["PROJECT_HOME"] = "/path/to/some/project/files"
# import certain project files
sys.path.append(environ["PROJECT_HOME"])
import mymodule
because it the sys.path is shared among all dags and this causes problems (eg. sharing of values between dag definitions) if want to import modules from different places that have the same name for different dag definitions (and if there are many dags, this is hard to keep track of).
The docs for using packaged dags (which seemed like a solution) do not seem to avoid the problem
the zip file will be inserted at the beginning of module search list (sys.path) and as such it will be available to any other code that resides within the same interpreter.
Anyone with more airflow knowledge know how to handle this kind of situation?
* Differs from linked-to question in that is less specific about implementation
Ended up doing something like this:
if os.path.isfile("%s/path/to/specific/module/%s.py" % (PROJECT_HOME, file_name)):
import imp
f = imp.load_source("custom_module", "%s/path/to/specific/module/%s.py" % (PROJECT_HOME, file_name))
df = f.myfunc(sparkSession, df)
To get the needed module file explicitly from known paths, based on the SO post here.

Scrapy spider will not crawl on start urls

I am brand new to scrappy and have worked my way through the tutorial and am trying to figure out how to implement what I have learned so far to complete a seemingly basic task. I know very little python so far and am using this as a learning experience, so if I ask a simple question, I apologize.
My goal for this program is to follow this link http://ucmwww.dnr.state.la.us/ucmsearch/FindDocuments.aspx?idx=xwellserialnumber&val=971683 and to extract the well serial number to a csv file. Eventually I want to run this spider on several thousand different well files and retrieve specific data. However, I am starting with the basics first.
Right now the spider doesnt crawl on any web page that I enter. There are no errors listed in the code when I run it, it just states that 0 pages were crawled. I cant quite figure out what I am doing wrong. I am positive the start url is ok as I have checked it out. Do I need a specific type of spider to accomplish what I am trying to do?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
class Sonrisdataaccess(Spider):
name = "serial"
allowed_domains = ["sonris.com"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498"]
def parse(self, response):
questions = Selector(response).xpath('/html/body/table[1]/tbody/tr[2]/td[1]')
for question in questions:
item = SonrisdataaccessItem()
item['serial'] = question.xpath ('/html/body/table[1]/tbody/tr[2]/td[1]').extract()[0]
yield item
Thank you for any help, I greatly appreciate it!
First of all I do not understand what you are doing in your for loop because if you have a selector you do not get the whole HTML again to select it...
Nevertheless, the interesting part is that the browser represents the table way different than it is downloaded with Scrapy. If you look at the response in your parse method you will see that there is no tbody element in the first table. This is why your selection does not return anything.
So to get the first serial number (as it is in your XPath) change your parse function to this:
def parse(self, response):
item = SonrisdataaccessItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
For later changes you may have to alter the XPath expression to get more data.

Most Pythonic way to provide global configuration variables in config.py? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
In my endless quest in over-complicating simple stuff, I am researching the most 'Pythonic' way to provide global configuration variables inside the typical 'config.py' found in Python egg packages.
The traditional way (aah, good ol' #define!) is as follows:
MYSQL_PORT = 3306
MYSQL_DATABASE = 'mydb'
MYSQL_DATABASE_TABLES = ['tb_users', 'tb_groups']
Therefore global variables are imported in one of the following ways:
from config import *
dbname = MYSQL_DATABASE
for table in MYSQL_DATABASE_TABLES:
print table
or:
import config
dbname = config.MYSQL_DATABASE
assert(isinstance(config.MYSQL_PORT, int))
It makes sense, but sometimes can be a little messy, especially when you're trying to remember the names of certain variables. Besides, providing a 'configuration' object, with variables as attributes, might be more flexible. So, taking a lead from bpython config.py file, I came up with:
class Struct(object):
def __init__(self, *args):
self.__header__ = str(args[0]) if args else None
def __repr__(self):
if self.__header__ is None:
return super(Struct, self).__repr__()
return self.__header__
def next(self):
""" Fake iteration functionality.
"""
raise StopIteration
def __iter__(self):
""" Fake iteration functionality.
We skip magic attribues and Structs, and return the rest.
"""
ks = self.__dict__.keys()
for k in ks:
if not k.startswith('__') and not isinstance(k, Struct):
yield getattr(self, k)
def __len__(self):
""" Don't count magic attributes or Structs.
"""
ks = self.__dict__.keys()
return len([k for k in ks if not k.startswith('__')\
and not isinstance(k, Struct)])
and a 'config.py' that imports the class and reads as follows:
from _config import Struct as Section
mysql = Section("MySQL specific configuration")
mysql.user = 'root'
mysql.pass = 'secret'
mysql.host = 'localhost'
mysql.port = 3306
mysql.database = 'mydb'
mysql.tables = Section("Tables for 'mydb'")
mysql.tables.users = 'tb_users'
mysql.tables.groups = 'tb_groups'
and is used in this way:
from sqlalchemy import MetaData, Table
import config as CONFIG
assert(isinstance(CONFIG.mysql.port, int))
mdata = MetaData(
"mysql://%s:%s#%s:%d/%s" % (
CONFIG.mysql.user,
CONFIG.mysql.pass,
CONFIG.mysql.host,
CONFIG.mysql.port,
CONFIG.mysql.database,
)
)
tables = []
for name in CONFIG.mysql.tables:
tables.append(Table(name, mdata, autoload=True))
Which seems a more readable, expressive and flexible way of storing and fetching global variables inside a package.
Lamest idea ever? What is the best practice for coping with these situations? What is your way of storing and fetching global names and variables inside your package?
How about just using the built-in types like this:
config = {
"mysql": {
"user": "root",
"pass": "secret",
"tables": {
"users": "tb_users"
}
# etc
}
}
You'd access the values as follows:
config["mysql"]["tables"]["users"]
If you are willing to sacrifice the potential to compute expressions inside your config tree, you could use YAML and end up with a more readable config file like this:
mysql:
- user: root
- pass: secret
- tables:
- users: tb_users
and use a library like PyYAML to conventiently parse and access the config file
I like this solution for small applications:
class App:
__conf = {
"username": "",
"password": "",
"MYSQL_PORT": 3306,
"MYSQL_DATABASE": 'mydb',
"MYSQL_DATABASE_TABLES": ['tb_users', 'tb_groups']
}
__setters = ["username", "password"]
#staticmethod
def config(name):
return App.__conf[name]
#staticmethod
def set(name, value):
if name in App.__setters:
App.__conf[name] = value
else:
raise NameError("Name not accepted in set() method")
And then usage is:
if __name__ == "__main__":
# from config import App
App.config("MYSQL_PORT") # return 3306
App.set("username", "hi") # set new username value
App.config("username") # return "hi"
App.set("MYSQL_PORT", "abc") # this raises NameError
.. you should like it because:
uses class variables (no object to pass around/ no singleton required),
uses encapsulated built-in types and looks like (is) a method call on App,
has control over individual config immutability, mutable globals are the worst kind of globals.
promotes conventional and well named access / readability in your source code
is a simple class but enforces structured access, an alternative is to use #property, but that requires more variable handling code per item and is object-based.
requires minimal changes to add new config items and set its mutability.
--Edit--:
For large applications, storing values in a YAML (i.e. properties) file and reading that in as immutable data is a better approach (i.e. blubb/ohaal's answer).
For small applications, this solution above is simpler.
How about using classes?
# config.py
class MYSQL:
PORT = 3306
DATABASE = 'mydb'
DATABASE_TABLES = ['tb_users', 'tb_groups']
# main.py
from config import MYSQL
print(MYSQL.PORT) # 3306
Let's be honest, we should probably consider using a Python Software Foundation maintained library:
https://docs.python.org/3/library/configparser.html
Config example: (ini format, but JSON available)
[DEFAULT]
ServerAliveInterval = 45
Compression = yes
CompressionLevel = 9
ForwardX11 = yes
[bitbucket.org]
User = hg
[topsecret.server.com]
Port = 50022
ForwardX11 = no
Code example:
>>> import configparser
>>> config = configparser.ConfigParser()
>>> config.read('example.ini')
>>> config['DEFAULT']['Compression']
'yes'
>>> config['DEFAULT'].getboolean('MyCompression', fallback=True) # get_or_else
Making it globally-accessible:
import configpaser
class App:
__conf = None
#staticmethod
def config():
if App.__conf is None: # Read only once, lazy.
App.__conf = configparser.ConfigParser()
App.__conf.read('example.ini')
return App.__conf
if __name__ == '__main__':
App.config()['DEFAULT']['MYSQL_PORT']
# or, better:
App.config().get(section='DEFAULT', option='MYSQL_PORT', fallback=3306)
....
Downsides:
Uncontrolled global mutable state.
A small variation on Husky's idea that I use. Make a file called 'globals' (or whatever you like) and then define multiple classes in it, as such:
#globals.py
class dbinfo : # for database globals
username = 'abcd'
password = 'xyz'
class runtime :
debug = False
output = 'stdio'
Then, if you have two code files c1.py and c2.py, both can have at the top
import globals as gl
Now all code can access and set values, as such:
gl.runtime.debug = False
print(gl.dbinfo.username)
People forget classes exist, even if no object is ever instantiated that is a member of that class. And variables in a class that aren't preceded by 'self.' are shared across all instances of the class, even if there are none. Once 'debug' is changed by any code, all other code sees the change.
By importing it as gl, you can have multiple such files and variables that lets you access and set values across code files, functions, etc., but with no danger of namespace collision.
This lacks some of the clever error checking of other approaches, but is simple and easy to follow.
Similar to blubb's answer. I suggest building them with lambda functions to reduce code. Like this:
User = lambda passwd, hair, name: {'password':passwd, 'hair':hair, 'name':name}
#Col Username Password Hair Color Real Name
config = {'st3v3' : User('password', 'blonde', 'Steve Booker'),
'blubb' : User('12345678', 'black', 'Bubb Ohaal'),
'suprM' : User('kryptonite', 'black', 'Clark Kent'),
#...
}
#...
config['st3v3']['password'] #> password
config['blubb']['hair'] #> black
This does smell like you may want to make a class, though.
Or, as MarkM noted, you could use namedtuple
from collections import namedtuple
#...
User = namedtuple('User', ['password', 'hair', 'name']}
#Col Username Password Hair Color Real Name
config = {'st3v3' : User('password', 'blonde', 'Steve Booker'),
'blubb' : User('12345678', 'black', 'Bubb Ohaal'),
'suprM' : User('kryptonite', 'black', 'Clark Kent'),
#...
}
#...
config['st3v3'].password #> passwd
config['blubb'].hair #> black
I did that once. Ultimately I found my simplified basicconfig.py adequate for my needs. You can pass in a namespace with other objects for it to reference if you need to. You can also pass in additional defaults from your code. It also maps attribute and mapping style syntax to the same configuration object.
please check out the IPython configuration system, implemented via traitlets for the type enforcement you are doing manually.
Cut and pasted here to comply with SO guidelines for not just dropping links as the content of links changes over time.
traitlets documentation
Here are the main requirements we wanted our configuration system to have:
Support for hierarchical configuration information.
Full integration with command line option parsers. Often, you want to read a configuration file, but then override some of the values with command line options. Our configuration system automates this process and allows each command line option to be linked to a particular attribute in the configuration hierarchy that it will override.
Configuration files that are themselves valid Python code. This accomplishes many things. First, it becomes possible to put logic in your configuration files that sets attributes based on your operating system, network setup, Python version, etc. Second, Python has a super simple syntax for accessing hierarchical data structures, namely regular attribute access (Foo.Bar.Bam.name). Third, using Python makes it easy for users to import configuration attributes from one configuration file to another.
Fourth, even though Python is dynamically typed, it does have types that can be checked at runtime. Thus, a 1 in a config file is the integer ‘1’, while a '1' is a string.
A fully automated method for getting the configuration information to the classes that need it at runtime. Writing code that walks a configuration hierarchy to extract a particular attribute is painful. When you have complex configuration information with hundreds of attributes, this makes you want to cry.
Type checking and validation that doesn’t require the entire configuration hierarchy to be specified statically before runtime. Python is a very dynamic language and you don’t always know everything that needs to be configured when a program starts.
To acheive this they basically define 3 object classes and their relations to each other:
1) Configuration - basically a ChainMap / basic dict with some enhancements for merging.
2) Configurable - base class to subclass all things you'd wish to configure.
3) Application - object that is instantiated to perform a specific application function, or your main application for single purpose software.
In their words:
Application: Application
An application is a process that does a specific job. The most obvious application is the ipython command line program. Each application reads one or more configuration files and a single set of command line options and then produces a master configuration object for the application. This configuration object is then passed to the configurable objects that the application creates. These configurable objects implement the actual logic of the application and know how to configure themselves given the configuration object.
Applications always have a log attribute that is a configured Logger. This allows centralized logging configuration per-application.
Configurable: Configurable
A configurable is a regular Python class that serves as a base class for all main classes in an application. The Configurable base class is lightweight and only does one things.
This Configurable is a subclass of HasTraits that knows how to configure itself. Class level traits with the metadata config=True become values that can be configured from the command line and configuration files.
Developers create Configurable subclasses that implement all of the logic in the application. Each of these subclasses has its own configuration information that controls how instances are created.

Resources