I'm new to airflow, can someone please help me with this as I'm unable to access 'db_conn' inside my custom operator, this argument defined in default_args.
**Dag details:**
default_args = {
'owner': 'airflow',
'email': ['example#hotmail.com'],
'db_conn': 'database_connection'
}
dag = DAG(dag_id='my_custom_operator_dag', description='Another tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2020, 8, 6),
catchup=False,
default_args=default_args)
operator_task_start = MyOperator(
task_id='my_custom_operator', dag=dag
)
**Operator details:**
class MyOperator(BaseOperator):
##apply_defaults
def __init__(self,
*args,
**kwargs):
super(MyOperator, self).__init__(*args,**kwargs)
def execute(self, context):
log.info('owner: %s', self.owner)
log.info('email: %s', self.email)
log.info('db_conn: %s', self.db_conn)
# Error here, AttributeError: 'MyOperator' object has no attribute 'db_conn
You seem to have misunderstood default_args. default_args is just a shorthand (code-cleanup / refactoring / brevity) to pass common (which have same value for all operators of DAG, like owner) args to all your operators, by setting them up as defaults and passing to the DAG itself. Quoting the docstring comment from DAG params
:param default_args: A dictionary of default parameters to be used
as constructor keyword parameters when initialising operators.
Note that operators have the same hook, and precede those defined
here, meaning that if your dict contains `'depends_on_past': True`
here and `'depends_on_past': False` in the operator's call
`default_args`, the actual value will be `False`.
:type default_args: dict
So clearly for default_args to work, any keys that you are passing there should be an argument of your Operator classes.
Not just that, do note that passing invalid (non-existent) arguments to Operator constructor(s) will be penalized in Airflow 2.0 (so better not pass any)
'Invalid arguments were passed to {c} (task_id: {t}). '
'Support for passing such arguments will be dropped in '
'future. ..
Hopefully, by now it must be clear that to make this work, you must add a param db_conn in constructor of your MyOperator class
**Operator details:**
class MyOperator(BaseOperator):
##apply_defaults
def __init__(self,
db_conn: str,
*args,
**kwargs):
super(MyOperator, self).__init__(*args,**kwargs)
self.db_conn: str = db_conn
And while we are at it, may I offer you a suggestion: for something like a connection, preferably use the Connection feature offered by Airflow which eases your interaction with external services
makes them manageable (view / edit via UI)
secure (they are stored encrypted in db)
support for load balancing (define multiple connections with same conn_id, Airflow will randomly distribute calls to one of those)
When there is more than one connection with the same conn_id, the
get_connection() method on BaseHook will choose one connection
randomly. This can be be used to provide basic load balancing and
fault tolerance, when used in conjunction with retries.
nice integration with built-in hooks
They also use the airflow.models.connection.Connection model to
retrieve hostnames and authentication information. Hooks keep
authentication code and information out of pipelines, centralized in
the metadata database.
Related
FastAPI supports having some (predefined) classes as pydantic model fields and have them be converted to JSON. For example datetime:
class MyModel(pydantic.BaseModel):
created_at: datetime.datetime
When used this model would convert datetime to/from str in the output/input JSON, when used as a response model or request body model, respectively.
I would like to have similar type safety for my own classes:
class MyModel(pydantic.BaseModel):
phone_number: phonenumbers.PhoneNumber
This can be made to work for request body models by using a custom validator but I also need MyModel to be convertible to JSON. Is this possible to achieve today? Note that I don't control the PhoneNumber class so the solution can't involve modifying that class.
Edit: the best I've come up with but still doesn't work:
def phone_number_validator(value: str) -> phonenumbers.PhoneNumber:
...
class MyModel(pydantic.BaseModel):
phone_number: phonenumbers.PhoneNumber
_validate_phone_number = pydantic.validator(
'phone_number', pre=True, allow_reuse=True)(phone_number_validator)
class Config:
arbitrary_types_allowed = True
json_encoders = {
phonenumbers.PhoneNumber: lambda p: phonenumbers.format_number(
p, phonenumbers.PhoneNumberFormat.E164),
}
This fails in FastAPI with:
fastapi.exceptions.FastAPIError: Invalid args for response field! Hint: check that <class 'phonenumbers.phonenumber.PhoneNumber'> is a valid pydantic field type
As you have already noticed, this is a bug in FastAPI. I just created a PR to fix it.
The arbitrary_types_allowed config directive is lost during the processing of the response model.
Until the PR is merged, you can use the workaround of monkey-patching the Pydantic BaseConfig like this:
from pydantic import BaseConfig
...
BaseConfig.arbitrary_types_allowed = True
# Your routes here:
...
But keep in mind that independent from this bug you might also need to adjust the JSON schema for the custom type, if you want the OpenAPI docs to work properly. Arbitrary types are not generally supported by the BaseModel.schema() method.
For that you can probably just inherit from phonenumbers.PhoneNumber and set a proper __modify_schema__ classmethod. See here for an example. Though I have not looked thoroughly into phonenumbers.
Check the example code in my PR text, if you want to see how you could implement validation and schema modification on your PhoneNumber subclass.
PS
Here is a full working example:
from __future__ import annotations
from typing import Union
from fastapi import FastAPI
from phonenumbers import PhoneNumber as _PhoneNumber
from phonenumbers import NumberParseException, PhoneNumberFormat
from phonenumbers import format_number, is_possible_number, parse
from pydantic import BaseModel, BaseConfig
class PhoneNumber(_PhoneNumber):
#classmethod
def __get_validators__(cls):
yield cls.validate
#classmethod
def validate(cls, v: Union[str, PhoneNumber]) -> PhoneNumber:
if isinstance(v, PhoneNumber):
return v
try:
number = parse(v, None)
except NumberParseException as ex:
raise ValueError(f'Invalid phone number: {v}') from ex
if not is_possible_number(number):
raise ValueError(f'Invalid phone number: {v}')
return number
#classmethod
def __modify_schema__(cls, field_schema: dict) -> None:
field_schema.update(
type="string",
# pattern='^SOMEPATTERN?$',
examples=["+49123456789"],
)
def json_encode(self) -> str:
return format_number(self, PhoneNumberFormat.E164)
class MyModel(BaseModel):
phone_number: PhoneNumber
class Config:
arbitrary_types_allowed = True
json_encoders = {
PhoneNumber: PhoneNumber.json_encode,
}
test_number = PhoneNumber(
country_code=49,
national_number=123456789
)
# Test:
obj = MyModel(phone_number=test_number)
obj_json = obj.json()
parsed_obj = MyModel.parse_raw(obj_json)
assert obj == parsed_obj
BaseConfig.arbitrary_types_allowed = True
api = FastAPI()
#api.get("/model/", response_model=MyModel)
def example_route():
return MyModel(phone_number=test_number)
I'm using Airflow in Google Composer. By default all the tasks in a DAG use the default connection to communicate with Storage, BigQuery, etc. Obviously we can specify another connection configured in Airflow, ie:
task_custom = bigquery.BigQueryInsertJobOperator(
task_id='task_custom_connection',
gcp_conn_id='my_gcp_connection',
configuration={
"query": {
"query": 'SELECT 1',
"useLegacySql": False
}
}
)
Is it possible to use a specific connection as the default for all tasks in the entire DAG?
Thanks in advance.
UPDATE:
Specify gcp_conn_id via default_args in DAG (as Javier Lopez Tomas recommended) doesn't work completely. The Operators that expect gcp_conn_id as parameter works fine, but in my case unfortunately most of interactions with GCP components do so via clients or hooks within PythonOperators.
For example: If I call DataflowHook (inside a function called by a PythonOperator) without specifying the connection, it internally uses "google_cloud_default" and not "gcp_conn_id" :(
def _dummy_func(**context):
df_hook = DataflowHook()
default_args = {
'gcp_conn_id': 'my_gcp_connection'
}
with DAG(default_args=default_args) as dag:
dummy = PythonOperator(python_callable=_dummy_func)
You can use default args:
https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#default-arguments
In your case it would be:
default_args = {
"gcp_conn_id": "my_gcp_connection"
}
with DAG(blabla)...
I'm learning Airflow and am planning to set some variables to use across different tasks. These are in my dags folder, saved as configs.json, like so:
{
"vars": {
"task1_args": {
"something": "This is task 1"
},
"task2_args": {
"something": "this is task 2"
}
}
}
I get that we can enter Admin-->Variables--> upload the file. But I have 2 questions:
What if I want to adjust some of the variables while airflow is running? I can adjust my code easily and it updates in realtime but it doesn't seem like this works for variables.
Is there a way to just auto-import this specific file on startup? I don't want to have to add it every time I'm testing my project.
I don't see this mentioned in the docs but it seems like a pretty trivial thing to want.
What you are looking for is With code, how do you update an airflow variable?
Here's an untested snippet that should help
from airflow.models import Variable
Variable.set(key="my_key", value="my_value")
So basically you can write a bootstrap python script to do this setup for you.
In our team, we use such scripts to setup all Connections, and Pools too
In case you are wondering, here's the set(..) method from source
#classmethod
#provide_session
def set(
cls,
key: str,
value: Any,
serialize_json: bool = False,
session: Session = None
):
"""
Sets a value for an Airflow Variable with a given Key
:param key: Variable Key
:param value: Value to set for the Variable
:param serialize_json: Serialize the value to a JSON string
:param session: SQL Alchemy Sessions
"""
if serialize_json:
stored_value = json.dumps(value, indent=2)
else:
stored_value = str(value)
Variable.delete(key, session=session)
session.add(Variable(key=key, val=stored_value))
session.flush()
Is it possible to pre-define variables, connections etc. in a file so that they are loaded when Airflow starts up? Setting them through the UI is not great from a deployment perspective.
Cheers
Terry
I'm glad that someone asked this question. In fact since Airflow completely exposes the underlying SQLAlchemy models to end-user, programmatic manipulation (creation, updation & deletion) of all Airflow models, particularly those used to supply configs like Connection & Variable is possible.
It may not be very obvious, but the open-source nature of Airflow means there are no secrets: you just need to peek in harder. Particularly for these use-cases, I've always found the cli.py to be very useful reference point.
So here's the snippet I use to create all MySQL connections while setting up Airflow. The input file supplied is of JSON format with the given structure.
# all imports
import json
from typing import List, Dict, Any, Optional
from airflow.models import Connection
from airflow.settings import Session
from airflow.utils.db import provide_session
from sqlalchemy.orm import exc
# trigger method
def create_mysql_conns(file_path: str) -> None:
"""
Reads MySQL connection settings from a given JSON file and
persists it in Airflow's meta-db. If connection for same
db already exists, it is overwritten
:param file_path: Path to JSON file containing MySQL connection settings
:type file_path: str
:return: None
:type: None
"""
with open(file_path) as json_file:
json_data: List[Dict[str, Any]] = json.load(json_file)
for settings_dict in json_data:
db_name: str = settings_dict["db"]
conn_id: str = "mysql.{db_name}".format(db_name=db_name)
mysql_conn: Connection = Connection(conn_id=conn_id,
conn_type="mysql",
host=settings_dict["host"],
login=settings_dict["user"],
password=settings_dict["password"],
schema=db_name,
port=settings_dict.get("port", mysql_conn_description["port"]))
create_and_overwrite_conn(conn=mysql_conn)
# utility delete method
#provide_session
def delete_conn_if_exists(conn_id: str, session: Optional[Session] = None) -> bool:
# Code snippet borrowed from airflow.bin.cli.connections(..)
try:
to_delete: Connection = (session
.query(Connection)
.filter(Connection.conn_id == conn_id)
.one())
except exc.NoResultFound:
return False
except exc.MultipleResultsFound:
return False
else:
session.delete(to_delete)
session.commit()
return True
# utility overwrite method
#provide_session
def create_and_overwrite_conn(conn: Connection, session: Optional[Session] = None) -> None:
delete_conn_if_exists(conn_id=conn.conn_id)
session.add(conn)
session.commit()
input JSON file structure
[
{
"db": "db_1",
"host": "db_1.hostname.com",
"user": "db_1_user",
"password": "db_1_passwd"
},
{
"db": "db_2",
"host": "db_2.hostname.com",
"user": "db_2_user",
"password": "db_2_passwd"
}
]
Reference links
With code, how do you update an airflow variable?
Problem updating the connections in Airflow programatically
How to create, update and delete airflow variables without using the GUI?
Programmatically clear the state of airflow task instances
airflow/api/common/experimental/pool.py
I have a job with 3 tasks
1) Get a token using a POST request
2) Get token value and store in a variable
3) Make a GET request by using token from step 2 and pass bearer token
Issue is step 3 is not working and i am getting HTTP error. I was able to print the value of token in the step 2 and verified in the code
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
token ="mytoken" //defined with some value which will be updated later
get_token = SimpleHttpOperator(
task_id='get_token',
method='POST',
headers={"Authorization": "Basic xxxxxxxxxxxxxxx=="},
endpoint='/token?username=user&password=pass&grant_type=password',
http_conn_id = 'test_http',
trigger_rule="all_done",
xcom_push=True,
dag=dag
)
def pull_function(**context):
value = context['task_instance'].xcom_pull(task_ids='get_token')
print("printing token")
print value
wjdata = json.loads(value)
print(wjdata['access_token'])
token=wjdata['access_token']
print token
run_this = PythonOperator(
task_id='print_the_context',
provide_context=True,
python_callable=pull_function,
dag=dag,
)
get_config = SimpleHttpOperator(
task_id='get_config',
method='GET',
headers={"Authorization": "Bearer " + token},
endpoint='someendpoint',
http_conn_id = 'test_conn',
trigger_rule="all_done",
xcom_push=True,
dag=dag
)
get_token >> run_this >> get_config
The way you are storing token as a "global" variable won't work. The Dag definition file (the script where you defined the tasks) is not the same runtime context as the one for executing each task. Every task can be run in a separate thread, process, or even on another machine, depending on the executor. The way you pass data between the tasks is not by global variables, but rather using the XCom - which you already partially do.
Try the following:
- remote the global token variable
- in pull_function instead of print token do return token - this will push the value to the XCom again, so the next task can access it
- access the token from XCom in your next task.
The last step is a bit tricky since you are using the SimpleHttpOperator, and it's only templated fields are endpoint and data, but not headers.
For example, if you wanted to pass in some data from the XCom of a previous task, you would do something like this:
get_config = SimpleHttpOperator(
task_id='get_config',
endpoint='someendpoint',
http_conn_id = 'test_conn',
dag=dag,
data='{{ task_instance.xcom_pull(task_ids="print_the_context", key="some_key") }}'
)
But you can't do the same with the headers unfortunately, so you have to either do it "manually" via a PythonOperator, or you could inherit SimpleHttpOperator and create your own, something like:
class HeaderTemplatedHttpOperator(SimpleHttpOperator):
template_fields = ('endpoint', 'data', 'headers') # added 'headers' headers
then use that one, something like:
get_config = HeaderTemplatedHttpOperator(
task_id='get_config',
endpoint='someendpoint',
http_conn_id = 'test_conn',
dag=dag,
headers='{{ task_instance.xcom_pull(task_ids="print_the_context") }}'
)
Keep in mind I did no testing on this, it's just for the purpose of explaining the concept. Play around with the approach and you should get there.